876 research outputs found

    A de Bruijn graph approach to the quantification of closely-related genomes in a microbial community

    Get PDF
    The wide applications of next-generation sequencing (NGS) technologies in metagenomics have raised many computational challenges. One of the essential problems in metagenomics is to estimate the taxonomic composition of a microbial community, which can be approached by mapping shotgun reads acquired from the community to previously characterized microbial genomes followed by quantity profiling of these species based on the number of mapped reads. This procedure, however, is not as trivial as it appears at first glance. A shotgun metagenomic dataset often contains DNA sequences from many closely-related microbial species (e.g., within the same genus) or strains (e.g., within the same species), thus it is often difficult to determine which species/strain a specific read is sampled from when it can be mapped to a common region shared by multiple genomes at high similarity. Furthermore, high genomic variations are observed among individual genomes within the same species, which are difficult to be differentiated from the inter-species variations during reads mapping. To address these issues, a commonly used approach is to quantify taxonomic distribution only at the genus level, based on the reads mapped to all species belonging to the same genus; alternatively, reads are mapped to a set of representative genomes, each selected to represent a different genus. Here, we introduce a novel approach to the quantity estimation of closely-related species within the same genus by mapping the reads to their genomes represented by a de Bruijn graph, in which the common genomic regions among them are collapsed. Using simulated and real metagenomic datasets, we show the de Bruijn graph approach has several advantages over existing methods, including (1) it avoids redundant mapping of shotgun reads to multiple copies of the common regions in different genomes, and (2) it leads to more accurate quantification for the closely-related species (and even for strains within the same species)

    MetaGT : A pipeline for de novo assembly of metatranscriptomes with the aid of metagenomic data

    Get PDF
    While metagenome sequencing may provide insights on the genome sequences and composition of microbial communities, metatranscriptome analysis can be useful for studying the functional activity of a microbiome. RNA-Seq data provides the possibility to determine active genes in the community and how their expression levels depend on external conditions. Although the field of metatranscriptomics is relatively young, the number of projects related to metatranscriptome analysis increases every year and the scope of its applications expands. However, there are several problems that complicate metatranscriptome analysis: complexity of microbial communities, wide dynamic range of transcriptome expression and importantly, the lack of high-quality computational methods for assembling meta-RNA sequencing data. These factors deteriorate the contiguity and completeness of metatranscriptome assemblies, therefore affecting further downstream analysis. Here we present MetaGT, a pipeline for de novo assembly of metatranscriptomes, which is based on the idea of combining both metatranscriptomic and metagenomic data sequenced from the same sample. MetaGT assembles metatranscriptomic contigs and fills in missing regions based on their alignments to metagenome assembly. This approach allows to overcome described complexities and obtain complete RNA sequences, and additionally estimate their abundances. Using various publicly available real and simulated datasets, we demonstrate that MetaGT yields significant improvement in coverage and completeness of metatranscriptome assemblies compared to existing methods that do not exploit metagenomic data. The pipeline is implemented in NextFlow and is freely available fromhttps://github.com/ablab/metaGT.Peer reviewe

    EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data

    Get PDF
    Recovery of ribosomal small subunit genes by assembly of short read community DNA sequence data generally fails, making taxonomic characterization difficult. Here, we solve this problem with a novel iterative method, based on the expectation maximization algorithm, that reconstructs full-length small subunit gene sequences and provides estimates of relative taxon abundances. We apply the method to natural and simulated microbial communities, and correctly recover community structure from known and previously unreported rRNA gene sequences. An implementation of the method is freely available at https://github.com/csmiller/EMIRGE

    Bioinformatics tools for analysing viral genomic data

    Get PDF
    The field of viral genomics and bioinformatics is experiencing a strong resurgence due to high-throughput sequencing (HTS) technology, which enables the rapid and cost-effective sequencing and subsequent assembly of large numbers of viral genomes. In addition, the unprecedented power of HTS technologies has enabled the analysis of intra-host viral diversity and quasispecies dynamics in relation to important biological questions on viral transmission, vaccine resistance and host jumping. HTS also enables the rapid identification of both known and potentially new viruses from field and clinical samples, thus adding new tools to the fields of viral discovery and metagenomics. Bioinformatics has been central to the rise of HTS applications because new algorithms and software tools are continually needed to process and analyse the large, complex datasets generated in this rapidly evolving area. In this paper, the authors give a brief overview of the main bioinformatics tools available for viral genomic research, with a particular emphasis on HTS technologies and their main applications. They summarise the major steps in various HTS analyses, starting with quality control of raw reads and encompassing activities ranging from consensus and de novo genome assembly to variant calling and metagenomics, as well as RNA sequencing

    Whole genome sequencing of a bacterium and a yeast isolated from the intestine of Atlantic salmon

    Get PDF
    Masteroppgave i genomikk - Nord universitet 2020Sperret til 2023-09-0

    Tilbake til det grunnleggende : forenkling av mikrobielle samfunn for å tolke komplekse interaksjoner

    Get PDF
    Microbes are everywhere and contribute to many essential processes relevant for planet Earth, ranging from biogeochemical cycles to complex human behavior. The means to achieve these colossal tasks for such small and, at first glance, simple organisms rely on their ability to assemble in heterogeneous communities in which populations with different taxonomies and functions coexist and complement each other. Some microbes are of particular interest for human civilization and have long been used for everyday tasks, such as the production of bread and wine. More recently, large-scale industrial and civil projects have taken advantage of the transformative capabilities of microbial communities, with key examples being biogas reactors, mining and wastewater treatment. Decades of classical microbiology, based on pure culture isolates and their physiological characterization, have built the foundations of modern microbial ecology. Molecular analysis of microbes and microbial communities has generated an understanding that for many microbial populations cultivation is hard to achieve and that breaking a community apart impacts its function. These limitations have driven the development of technical tools that bring us directly in contact with communities in their natural environment. In the mid 2000’s the recently established “omics” techniques were quickly adapted to their “meta-omics” version, enabling direct analysis of the microbial samples without culture. Every class of molecules (DNA, RNA, protein, metabolite, etc.) can now theoretically be analyzed from the entire community within a given sample. Metagenomics uses community DNA to build the phylogenetic picture and the genetic potential, whereas metatranscriptomics and metaproteomics employ RNA and proteins respectively to inquire the gene expression of the community. Finally, meta-metabolomics can close the loop and describe the metabolic activity of the microbes. Here, we combined the four aforementioned major meta-omics disciplines in a gene- and population-centric perspective to re-iterate the same Aristotelian question underlying microbial ecology: how is it possible that the whole is more than the sum of its parts? Along the detailed answers provided by the individual communities in various environments, we also tried to learn something about biology itself. We first addressed in a saccharolytic and methane-producing minimalistic consortium (SEM1b), the strain-specific interplay engaged in (hemi)cellulose degradation, explaining the ubiquity of Coprothermobacter proteolyticus in biogas reactors. We showed through the genetic potential of the C. proteolyticus-affiliated COPR1 population, the putative acquisition via horizontal gene transfer of a gene cassette for hemicellulose degradation. Moreover, we showed how the gene expression of these COPR1 genes were both coherent with the release of hemicellulose by another population of the community (RCLO1) and synced with the gene expression of the orthologous genes of an already known hemicellulolytic population (CLOS1). Conclusively, we demonstrated how the same purified COPR1 protein (Glycosyl Hydrolases 16) showed endoglucanase activity on several hemicellulose substrates. Secondly, we explored the combined application of absolute omics-based quantification of RNA and proteins using SEM1b as a benchmark community, due to its lower complexity (less than 12 populations) and relatively resolved biology. We subsequently demonstrated that the uncultured bacterial populations in SEM1b followed the expected protein-to-RNA ratio (102-104) of previously analyzed cultured bacteria in exponential phase. In contrast, an archaeon population from SEM1b showed values in the range 103-105, the same as what has been reported for eukaryotes (yeast and human) in the literature. In addition, we modeled the linearity (k) between genome-centric transcriptomes and proteomes over time and used it to predict the essential metabolic populations of the SEM1b community through converging and parallel k-trends, which was subsequently confirmed via classical pathway analysis. Finally, we estimated the translation and the protein degradation rates, coming to the conclusion that some of the processes in the cell that require a rapid tuning (e.g. metabolism and motility) are regulated (also) post-transcriptionally. Thirdly we sought to apply our approach of collapsing complex datasets into simplistic metrics in order to identify underlying community trends, onto a more complex and “real-world” microbiome. To do this, we resolved more than one year of weekly sampling from a lipid-accumulating community (Shif-LAO) that inhabits a wastewater treatment in Shifflange (Luxembourg), and showed an extreme genetic redundancy and turnover in contrast to a more conservative trend in functions. Moreover, we demonstrated how the time patterns (e.g. seasonality) in both gene count and gene expression are linked with the physico-chemical parameters associated with the corresponding samples. Furthermore, we built the static reaction network underlying the whole community over the complete dataset (51 temporal samples). From this, we characterized the sub-network for lipid accumulation, and showed that its more expressed nodes were defined by resource competition between different taxa (deduced via inverse taxonomic richness and gene expression over time). In contrast, the nitrogen metabolism sub-network instead exhibited a dominant taxon and a keystone ammonia oxidizing monooxygenase, the first enzyme of ammonia oxidation, which may lead to the production of nitrous gas (a powerful greenhouse gas). Overall, our results presented in this thesis build a comprehensive repertoire of interactions in microbial communities ranging from a simplistic (10’s of populations) consortium to a natural complex microbiome (100’s of populations). These were ultimately uncovered using an array of techniques, including unsupervised gene expression clustering, pathway analysis, reaction networks, co-expression networks, eigengenes and linearity trends between transcriptome and proteome. Moreover, we learnt that to achieve a full understanding of microbial ecology and detailed interactions, we need to integrate all the meta-omics layers quantified with absolute measurements. However, when scaling these approaches to real-world communities the massive amounts of generated data brings new challenges and necessitates simplifying strategies to reduce complexity and extrapolate ecological trends.Mikroorganismer er overalt og de bidrar til mange essensielle prosesser som er viktige for planeten vår, alt fra biokjemiske sykluser til kompleks menneskelig oppførsel. Midlene disse små, og ved første øyekast enkle organismene bruker for å oppnå så betydelige oppgaver på, ligger i deres evne til å forenes i et heterogent samfunn der ulike populasjoner med en forskjellig taksonomi og funksjoner sameksisterer og utfyller hverandre. Noen mikrobielle samfunn er av særlig interesse for oss mennesker, og har i lang tid blitt utnyttet i hverdagslige gjøremål, slik som produksjon av brød og vin. I senere tid har også stor-skala industri og kommunale anlegg, for eksempel biogass reaktorer og renseanlegg, dratt nytte av mikrobesamfunns evne til å transformere. Tiår med klassisk mikrobiologi, basert på dyrking og fysiologisk karakterisering av renkulturer har bygget grunnlaget for moderne mikrobiell økologi. Molekylære analyser av mikrober og mikrobielle samfunn har resultert i forståelsen om at mange mikrobielle populasjoner er vanskelige å kultivere, og at en oppdeling av samfunnet vil påvirke dens funksjoner. Disse begrensningene har vært en drivkraft for utviklingen av tekniske verktøy som kan bringe oss i direkte kontakt med mikrobesamfunnet i deres naturlige miljø. I midten av 2000-talles ble de nylig etablerte «omikk»-teknikkene raskt adoptert til også å gjelde «meta-omikk», som muliggjør direkte analysering av mikrobielle samfunn uten kultivering. I dag kan i teorien hver molekylerære klasse (DNA, RNA, proteiner, metabolitter, osv.) bli analysert fra hele mikrobesamfunn i en bestemt prøve. I metagenomikk benyttes DNA-innholdet til å konstruere et fylogenetisk bilde av samfunnet og det genetiske potensiale, mens metatranskriptomikk og metaproteomikk bruker henholdsvis RNA og proteiner for å se på gen-uttrykket i samfunnet. Meta-metabolomikk kan slutte sirkelen ved å beskrive den metabolske aktiviteten til mikrobene. I arbeidet som ligger til grunn for denne avhandlingen, kombinerte vi fire av de nevnte fagfeltene innen meta-omikk i et gen- og populasjons-orientert perspektiv for å gjenta det samme Aristoteliske spørsmålet bak mikrobiell økologi: hvordan er det mulig at helheten er større enn summen av enkeltdelene? Sammen med de detaljerte svarene som ble gitt av de enkelte mikrobesamfunnene i ulike miljøer, forsøkte vi også å lære noe om biologi i seg selv. Først adresserte vi det stamme-spesifikke samspillet involvert i (hemi)cellulose degradering i et sakkarolytisk og metan-produserende minimalistisk konsortium (SEM1b), som belyser omfanget av Coprothermobacter proteolyticus i biogass reaktorer. Gjennom det genetiske potensiale til COPR1-populasjonen tilknyttet C. proteolyticus, viste vi den antatte ervervelsen, via horisontal gen-overføring, av en gen-kassett for nedbrytning av hemicellulose. Videre viste vi hvordan genuttrykket til disse COPR1-genene var i samsvar med frigivelsen av hemicellulose av en annen populasjon i samfunnet (RCLO1), og synkronisert med genuttrykket av de ortologe genene fra en allerede kjent hemicellulolytisk populasjon (CLOS1). Avslutningsvis demonstrerte vi hvordan det samme rensede COPR1-proteinet (glykosid-hydrolase 16) viste endoglukanase-aktivitet på flere hemicellulosesubstrater. På grunn av lavere kompleksitet (færre enn 12 populasjoner) og en relativt kjent biologi, benytte vi SEM1b videre som et referansesamfunn for å utforske den kombinerte anvendelsen av absolutt omikk-basert kvantifisering av RNA og proteiner. Vi demonstrerte deretter at de ukultiverte bakterie-populasjonene i SEM1b fulgte en protein-til-RNA ratio (102-104) som var forventet basert på tidligere analyser av bakteriekulturer i eksponentiell fase. I kontrast til dette viste en arkeonpopulasjon fra SEM1b verdier i området mellom 103-105, som er det samme som tidligere rapportert i litteraturen for eukaryote (gjær og menneske). I tillegg modellerte vi lineariteten (k) mellom genom-orienterte transkriptomer og proteomer over tid, og brukte dette til å forutsi de essensielle metabolsk populasjon i SEM1b-samfunnet gjennom konvergerende og parallelle k-trender, som senere ble bekreftet via klassiske analyser av metabolske synteseveier. Til slutt estimerte vi frekvensen av translasjon og protein degradering, hvorpå vi konkluderte med at noen av prosessene i en celle som krever rask innstilling (som for eksempel metabolisme og bevegelse) er regulert (også) post- transkripsjonelt. Til slutt ønsket vi å anvende vår tilnærming for å sette komplekse datasett inn i forenklede matriser for å identifisere underliggende trender i mikrosamfunnet, på et mer komplekst og virkelighetsnært mikrobiom. Til dette benyttet vi et mer enn ett år med ukentlige prøvetakninger fra en lipid-akkumulerende mikrobesamfunn (Shif-LAO) i et renseanlegg i Shifflange (Luxembourg), og avdekket en ekstrem genetisk redundans og turnover, i motsetning til en mer konservativ trend i funksjoner. Videre demonstrerte vi hvordan tidsavhengige mønstre (som for eksempel sesongvariasjoner) i både antall gener og genuttrykk er knyttet til fysisk-kjemiske parameter assosiert med de tilsvarende prøvene. I tillegg rekonstruerte vi det underliggende statiske reaksjonsnettverket til mikrobesamfunnet over hele datasettet (51 prøver over tid). Basert på dette, karakteriserte vi sub-nettverk for lipid-akkumulering, og demonstrerte at mer uttrykte noder var definert av konkurransen om ressurser mellom ulike taksonomiske grupper (antatt via reversert taksonomisk diversitet og genuttrykk over tid). I motsetning til dette, viste nettverket for nitrogen-metabolismen i stedet et dominerende taxon og en keystone ammoniakk-oksiderende monooxygenase, det første enzymet i ammoniakk oksidasjon, som fører til produksjonen av lystgass (en svært sterk klimagass). Resultatene presentert i denne doktorgradsavhandlingen bygger på et omfattende repertoar av interaksjoner i mikrobielle samfunn som spenner fra et forenklet konsortium (titalls populasjoner) til et naturlig komplekst mikrobiom (hundretalls populasjoner). Disse mikrobiomene ble til slutt kartlagt ved hjelp av en rekke teknikker, blant annet unsupervised gruppering av genutrykk, analyser av metabolisk synteseveier, nettverk av reaksjoner og co-uttrykte gener, eigengener og lineære trender mellom transkriptom og proteom. I tillegg erfarte vi at for å oppnå en full forståelse av mikrobiell økologi og detaljerte interaksjoner må vi integrere alle lagene av meta-omikk, kvantifisert med absolutte målinger. Når man oppskalering disse tilnærmingen til virkelige mikrobesamfunn, bringer imidlertid enorme mengder generert data til nye utfordringer som nødvendiggjør en forenkling av strategier for å redusere kompleksiteten og ekstrapolerer økologiske trender

    Transcriptomic Profiling Using Next Generation Sequencing - Advances, Advantages, and Challenges

    Get PDF
    Transcriptome, the functional element of the genome, is comprised of different kinds of RNA molecules such as mRNA, miRNA, ncRNA, rRNA, and tRNA to name a few. Each of these RNA molecules plays a vital role in the physiological response, and understanding the regulation of these molecules is extremely critical for the better understanding of the functional genome. RNA Sequencing (RNASeq) is one of the latest techniques applied to study genome-wide transcriptome characterization and profiling using high-throughput sequenced data. As compared to array-based methods, RNASeq provides in-depth and more precise information on transcriptome characterization and quantification. Based upon availability of reference genome, transcriptome assembly can be reference-guided or de novo. Once transcripts are assembled, downstream analysis such as expression profiling, gene ontology, and pathway enrichment analyses can give more insight into gene regulation. This chapter describes the significance of RNASeq study over array-based traditional methods, approach to analyze RNASeq data, available methods and tools, challenges associated with the data analysis, application areas, some of the recent advancement made in the area of transcriptome study and its application
    corecore