815 research outputs found

    PhylOTU: a high-throughput procedure quantifies microbial community diversity and resolves novel taxa from metagenomic data.

    Get PDF
    Microbial diversity is typically characterized by clustering ribosomal RNA (SSU-rRNA) sequences into operational taxonomic units (OTUs). Targeted sequencing of environmental SSU-rRNA markers via PCR may fail to detect OTUs due to biases in priming and amplification. Analysis of shotgun sequenced environmental DNA, known as metagenomics, avoids amplification bias but generates fragmentary, non-overlapping sequence reads that cannot be clustered by existing OTU-finding methods. To circumvent these limitations, we developed PhylOTU, a computational workflow that identifies OTUs from metagenomic SSU-rRNA sequence data through the use of phylogenetic principles and probabilistic sequence profiles. Using simulated metagenomic data, we quantified the accuracy with which PhylOTU clusters reads into OTUs. Comparisons of PCR and shotgun sequenced SSU-rRNA markers derived from the global open ocean revealed that while PCR libraries identify more OTUs per sequenced residue, metagenomic libraries recover a greater taxonomic diversity of OTUs. In addition, we discover novel species, genera and families in the metagenomic libraries, including OTUs from phyla missed by analysis of PCR sequences. Taken together, these results suggest that PhylOTU enables characterization of part of the biosphere currently hidden from PCR-based surveys of diversity

    The Parallelism Motifs of Genomic Data Analysis

    Get PDF
    Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

    A probabilistic model to recover individual genomes from metagenomes

    Get PDF

    Widespread recombination, reassortment, and transmission of unbalanced compound viral genotypes in natural arenavirus infections.

    Get PDF
    Arenaviruses are one of the largest families of human hemorrhagic fever viruses and are known to infect both mammals and snakes. Arenaviruses package a large (L) and small (S) genome segment in their virions. For segmented RNA viruses like these, novel genotypes can be generated through mutation, recombination, and reassortment. Although it is believed that an ancient recombination event led to the emergence of a new lineage of mammalian arenaviruses, neither recombination nor reassortment has been definitively documented in natural arenavirus infections. Here, we used metagenomic sequencing to survey the viral diversity present in captive arenavirus-infected snakes. From 48 infected animals, we determined the complete or near complete sequence of 210 genome segments that grouped into 23 L and 11 S genotypes. The majority of snakes were multiply infected, with up to 4 distinct S and 11 distinct L segment genotypes in individual animals. This S/L imbalance was typical: in all cases intrahost L segment genotypes outnumbered S genotypes, and a particular S segment genotype dominated in individual animals and at a population level. We corroborated sequencing results by qRT-PCR and virus isolation, and isolates replicated as ensembles in culture. Numerous instances of recombination and reassortment were detected, including recombinant segments with unusual organizations featuring 2 intergenic regions and superfluous content, which were capable of stable replication and transmission despite their atypical structures. Overall, this represents intrahost diversity of an extent and form that goes well beyond what has been observed for arenaviruses or for viruses in general. This diversity can be plausibly attributed to the captive intermingling of sub-clinically infected wild-caught snakes. Thus, beyond providing a unique opportunity to study arenavirus evolution and adaptation, these findings allow the investigation of unintended anthropogenic impacts on viral ecology, diversity, and disease potential

    Targeted Computational Approaches for Mining Functional Elements in Metagenomes

    Get PDF
    Thesis (Ph.D.) - Indiana University, Informatics, 2012Metagenomics enables the genomic study of uncultured microorganisms by directly extracting the genetic material from microbial communities for sequencing. Fueled by the rapid development of Next Generation Sequencing (NGS) technology, metagenomics research has been revolutionizing the field of microbiology, revealing the taxonomic and functional composition of many microbial communities and their impacts on almost every aspect of life on Earth. Analyzing metagenomes (a metagenome is the collection of genomic sequences of an entire microbial community) is challenging: metagenomic sequences are often extremely short and therefore lack genomic contexts needed for annotating functional elements, while whole-metagenome assemblies are often poor because a metagenomic dataset contains reads from many different species. Novel computational approaches are still needed to get the most out of the metagenomes. In this dissertation, I first developed a binning algorithm (AbundanceBin) for clustering metagenomic sequences into groups, each containing sequences from species of similar abundances. AbundanceBin provides accurate estimations of the abundances of the species in a microbial community and their genome sizes. Application of AbundanceBin prior to assembly results in better assemblies of metagenomes--an outcome crucial to downstream analyses of metagenomic datasets. In addition, I designed three targeted computational approaches for assembling and annotating protein coding genes and other functional elements from metagenomic sequences. GeneStitch is an approach for gene assembly by connecting gene fragments scattered in different contigs into longer genes with the guidance of reference genes. I also developed two specialized assembly methods: the targeted-assembly method for assembling CRISPRs (Clustered Regularly Interspersed Short Palindromic Repeats), and the constrained-assembly method for retrieving chromosomal integrons. Applications of these methods to the Human Microbiome Project (HMP) datasets show that human microbiomes are extremely dynamic, reflecting the interactions between community members (including bacteria and viruses)

    Bruk av Liquid Array Diagnostics (LAD) som verktøy for analyse av sammensetning og funksjon av tarmens mikrobiota

    Get PDF
    The microbial species residing in the human gut exercise vital functions for the host. They produce different metabolites that are crucial for human wellbeing. A variety of such molecules mediate signalling along the gut-brain axis, regulate host gene expression, develop and maintain intestinal and blood-brain barriers, are involved in lipogenesis and gluconeogenesis, in addition to taking part in a wide range of other functions. A deviation in the intestinal flora composition is mechanistically linked to various health disorders, including inflammatory bowel disease (IBD), irritable bowel syndrome (IBS), type 2 diabetes, Parkinson’s and Alzheimer’s disease. Such a deviation, known as dysbiosis, represents an unbalanced composition where certain microbial groups are promoted in the expense of others. These species are considered as promising biomarkers, valuable for disease diagnosis, monitoring and treatment. Of particular interest are those markers that can additionally unveil phenotypical characteristics, such as the overall level of short-chain fatty acids (SCFA) in human gut samples. The prospect of discovering additional markers is high, considering that the content of healthy human guts worldwide is not fully characterized. The field of gut microbiota is at a stage of switching focus to clinically relevant species, particularly to their rapid detection, as a means of offering simple diagnostic solutions with increased availability and accessibility. This affords putting biological findings to practical clinical use, which is often not feasible with current species identification platforms. With the intention of filling this need, the main aim of this thesis was to develop a targeted approach for rapid gut microbiota testing based on the novel Liquid Array Diagnostics (LAD) technology. LAD is adopted to target 16S rRNA gene sites unique for specific microbial groups. Requiring only commonplace qPCR instrumentation, it can detect up to 30 distinct microbial markers in a single-tube multiplex reaction within a working day. LAD’s utility in microbiome studies was validated by testing the prevalence and abundance of 15 microbial markers in 541 samples collected from mothers and their children, as reported in Paper I. Paper II, on the other hand, describes a comprehensive human gut prokaryotic genome collection, HumGut. It was built after screening thousands of human gut metagenome samples, collected from healthy people worldwide, for the presence of any high quality publicly available prokaryote genome. The main rationale for creating it was to enable functional studies through LAD-based 16S targeting. It was demonstrated that HumGut, as a reference database, aids whole genome sequencing studies by significantly increasing the number of mapped sequencing reads, thus elevating the potential for an improved taxonomic classification. However, as it is, HumGut exhibits limited practical use for 16S rRNA gene targeted approaches like LAD. This because most of the representative genomes either lack this gene, or the quality of 16S sequences is compromised (addressed in Paper III). Nonetheless, LAD was exploited to infer a segment of human gut microbiota functionality by targeting the 16S rRNA gene. This was performed based on data retrieved from 16S rDNA sequencing and short-chain fatty acid (SCFA) measurements. LAD’s value in classifying samples with disturbed SCFA ratios (namely high propionate-to-butyrate ratio) - an indication of functional dysbiosis - is presented in Paper IV. Taken together, this thesis introduces two tools, LAD and HumGut, both pointing at the direction of simplified human gut functional analysis via gut microbial composition detection.De mikrobielle artene som bor i menneskets tarm utøver vitale funksjoner for verten. De produserer forskjellige metabolitter avgjørende for menneskers helse. En rekke av disse molekylene deltar i prosesser som signaltransduksjon langs tarm-hjerne-aksen, regulering av genekspresjon, utvikling og vedlikehold av tarm- og blod-hjerne-barrieren, lipogenese og glukoneogenese, samt en rekke andre funksjoner. Avvik i tarmflorasammensetningen kan knyttes til mange ulike sykdommer og lidelser, inkludert irritabel tarm (IBS), innflammatorisk tarmsykdom (IBD), type -2 diabetes, Parkinsons og Alzheimers sykdom. Slike avvik, kjent som dysbiose, kjennetegnes av at visse mikrobielle grupper fremmes på bekostning av andre. Disse artene har potensiale som biomarkører, og kan slik være verdifulle for sykdomsdiagnose og behandling. Spesielt lovende er biomarkører i tarm som kan knyttes opp mot phenotypiske trekk, slik som kortkjedede fettsyrer (SCFA). Det antas at enda flere slike arter vil identifiseres i fremtiden, da mikrobiota-komposisjonen i sunne tarmer ikke er fullt karakterisert globalt. Mikrobiota-feltet er nå på et stadium hvor fokuset endres fra eksplorative studier til identifisering av klinisk relevante arter. Det vil da bli spesielt viktig med metoder som muliggjør rask deteksjon, da dette vil innebære enkle diagnostiske løsninger tilgjengelig for praktisk klinisk bruk, noe som ofte ikke er gjennomførbart med dagens artsidentifikasjonsplattformer. Hovedmålet med denne oppgaven var å utvikle en målrettet tilnærming for rask tarmmikrobiotatesting basert på det nye Liquid Array Diagnostics (LAD)-prinsippet. LAD er utviklet for å identifisere sekvenser i 16S rRNA-genet som er unike for spesifikke mikrobielle markører. Metoden krever kun et vanlig qPCR-instrument og kan oppdage inntil 30 forskjellige mikrobielle markører i étt enkelt test-rør i løpet av en arbeidsdag. LADs nytteverdi i mikrobiomstudier ble validert ved å teste forekomsten av 15 mikrobielle markører i 541 prøver samlet fra mødre og deres barn, som rapportert i Artikel I. Artikel II beskriver genereringen av en omfattende prokaryot genomsamling av menneskets tarm. Den ble bygget ved å screene tusenvis av metagenom fra tarmprøver samlet inn fra friske mennesker over hele verden. Metagenomene ble screenet for tilstedeværelse av alle offentlig tilgjengelige prokaryote genom. Sekvenser av dårlig kvalitet ble fjernet mens alle andre sekvenser ble samlet i én stor referansedatabase, HumGut. Hovedmålet med å lage denne referansedatabasen var å muliggjøre LAD-baserte funksjonelle studier. Det ble vist at HumGut fungerer som et nyttig verktøy for full-genoms sekvenseringsstudier ved å øke antallet artlagte sekvenseringsavlesninger betydelig, da dette gir forbedret taksonomisk klassifisering. HumGut har imidlertid begrenset nytteverdi for 16S rRNA-baserte metoder som LAD. Dette fordi de fleste genom i samlingen enten mangler dette genet fullstendig, eller har for dårlig kvalitet på 16S-sekvensene (behandlet i Artikel III). Til tross for begrensningene knyttet til 16S rRNA-genet i HumGut, ble LAD benyttet til å utvikle en 16S rDNA-basert test for måling av menneskelig tarmmikrobiotafunksjonalitet. Dette ble utført basert på data hentet fra 16S-sekvensering og målinger av kortkjedede fettsyrer (SCFA). LADs evne til å klassifisere prøver med forstyrret SCFA-forhold (nemlig høyt propionat-tilbutyrat-forhold) - en indikasjon på funksjonell dysbiose - er presentert i Artikel IV. Til sammen presenterer denne oppgaven to verktøy, LAD og HumGut, som begge peker i retning av forenklet funksjonell analyse av menneskelig tarm via deteksjon av mikrobiell sammensetning i tarmen

    Use of Whole Genome Shotgun Sequencing for the Analysis of Microbial Communities in Arabidopsis thaliana Leaves

    Get PDF
    Microorganisms, such as all Bacteria, Archaeae, and some Eukaryotes, inhabit all imaginable habitats in the planet, from water vents in the deep ocean to extreme environments of high temperature and salinity. Microbes also constitute the most diverse group of organisms in terms if genetic information, metabolic function, and taxonomy. Furthermore, many of these microbes establish complex interactions with each others and with many other multicellular organisms. The collection of microbes that share a body space with a plant or animal is called the microbiota, and their genetic information is called the microbiome. The microbiota has emerged as a crucial determinant of a host’s overall health and understanding it has become crucial in many biological fields. In mammals, the gut microbiota has been linked to important diseases such as diabetes, inflammatory bowel disease, and dementia. In plants, the microbiota can provide protection against certain pathogens or confer resistance against harsh environmental conditions such as drought. Furthermore, the leaves of plants represent one of the largest surface areas that can potentially be colonized by microbes. The advent of sequencing technologies has let researchers to study microbial communities at unprecedented resolution and scale. By targeting individual loci such as the 16S rDNA locus in bacteria, many species can be studied simultaneously, as well as their properties such as relative abundance without the need of individual isolation of target taxa. Decreasing costs of DNA sequencing has also led to whole shotgun sequencing where instead of targeting a single or a number of loci, random fragments of DNA are sequenced. This effectively renders the entire microbiome accessible to study, referred to as metagenomics. Consequently many more areas of investigation are open, such as the exploration of within host genetic diversity, functional analysis, or assembly of individual genomes from metagenomes. In this study, I described the analysis of metagenomic sequencing data from microbial 11 communities in leaves of wild Arabidopsis thaliana individuals from southwest Germany. As a model organisms, A. thaliana not only is accessible in the wild but also has a rich body of previous research in plant-microbe interactions. In the first section, I describe how whole shotgun sequencing of leaf DNA extracts can be used to accurately describe the taxonomic composition of the microbial community of individual hosts. The nature of whole shotgun sequencing is used to estimate true microbial abundances which can not be done with amplicons sequencing. I show how this community varies across hosts, but some trends are seen, such as the dominance of the bacterial genera Pseudomonas and Sphingomonas . Moreover, even though there is variation between individuals, I explore the influence of site of origin and host genotype. Finally, metagenomic assembly is applied to individual samples, showing the limitations of WGS in plant leaves. In the second section, I explore the genomic diversity of the most abundant genera: Pseudomonas and Sphingomonas . I use a core genome approach where a set of common genes is obtained from previously sequenced and assembled genomes. Thereafter, the gene sequences of the core genome is used as a reference for short genome mapping. Based on these mappings, individual strain mixtures are inferred based on the frequency distribution of non reference bases at each detected single nucleotide polymorphism (SNP). Finally, SNP’s are then used to derive population structure of strain mixtures across samples and with known reference genomes. In conclusion, this thesis provides insights into the use of metagenomic sequencing to study microbial populations in wild plants. I identify the strengths and weaknesses of using whole genome sequencing for this purpose. As well as a way to study strain level dynamics of prevalent taxa within a single host

    Tilatehokas metagenomisten DNA-fragmenttien ryhmittely

    Get PDF
    The collection of all genomes in an environment is called the metagenome of the environment. In the past 15 years, high-throughput sequencing has made it feasible to sequence entire environments at once for the first time in history, which has resulted in a variety of interesting new algorithmic problems. This thesis focuses on the basic problem of clustering the reads from an environment according to which species, or more generally, taxonomic unit they originate from. In this work, we identify and formalize two fundamental string processing tasks useful in clustering metagenomic read sets. We solve the two problems with space efficiency in mind using the recently developed bidirectional Burrows-Wheeler index. The algorithms were implemented in a way which makes parallel processing possible. Our tool is experimentally shown to give good results for simple simulated datasets, and to use less than 10 times less space and time compared to two recently published metagenome clustering tools.Kaikkien ympäristössä esiintyvien genomien joukkoa kutsutaan kyseisen ympäristön \emph{metagenomiksi}. Viimeisen 15 vuoden aikana kehitetyt korkean läpisyötön sekvenssoriteknologiat ovat mahdollistaneet ensimmäistä kertaa historiassa kokonaisen ympäristön metagenomin kartoittamisen. Tämä kehityssuunta on johtanut uusiin mielenkiintoisiin algoritmisiin ongelmiin. Tämä työ käsittelee ympäristöistä näytteistettyjen DNA-fragmenttejen ryhmittelyä lajien, tai yleisemmin taksonomisten yksiköiden mukaan. Työssä tunnistetaan ja formalisoidaan kaksi merkkijono-ongelmaa, jotka ilmentyvät metagenomisten DNA-fragmentteja ryhmittelyssä. Ongelmiin esitetään tilatehokkaat ratkaisut käyttäen hiljattain kehitettyä kaksisuuntaista Burrows-Wheeler indeksiä. Algoritmit toteutettiin pitäen silmällä rinnakkaista laskentaa. Työssä osoitetaan, että uusi toteutus antaa hyviä tuloksia yksinkertaisille simuloiduille näytteille, ja että työkalu on kymmenen kertaa nopeampi ja tilatehokkaampi, kuin kaksi hiljattain julkaistua metagenomisten näytteiden ryhmittelyyn tarkoitettua työkalua
    • …
    corecore