989 research outputs found

    Multiple-Genome Annotation of Genome Fragments Using Hidden Markov Model Profiles

    Get PDF
    To learn more about microbes and overcome the limitations of standard cultured methods, microbial communities are being studied in an uncultured state. In such metagenomic studies, genetic material is sampled from the environment and sequenced using the whole-genome shotgun sequencing technique. This results in thousands of DNA fragments that need to be identified, so that the composition and inner workings of the microbial community can begin to be understood. Those fragments are then assembled into longer portions of sequences. However the high diversity present in an environment and the often low level of genome coverage achieved by the sequencing technology result in a low number of assembled fragments (contigs) and many unassembled fragments (singletons). The identification of contigs and singletons is usually done using BLAST, which finds sequences similar to the contigs and singletons in a database. An expert may then manually read these results and determine if the function and taxonomic origins of each fragment can be determined. In this report, an automated system called Anacle is developed to annotate, following a taxonomy, the unassembled fragments before the assembly process. Knowledge of what proteins can be found in each taxon is built into Anacle by clustering all known proteins of that taxon. The annotation performances from using Markov clustering (MCL) and Self- Organizing Maps (SOM) are investigated and compared. The resulting protein clusters can each be represented by a Hidden Markov Model (HMM) profile. Thus a “skeleton” of the taxon is generated with the profile HMMs providing a summary of the taxon’s genetic content. The experiments show that (1) MCL is superior to SOMs in annotation and in running time performance, (2) Anacle achieves good performance in taxonomic annotation, and (3) Anacle has the ability to generalize since it can correctly annotate fragments from genomes not present in the training dataset. These results indicate that Anacle can be very useful to metagenomics projects

    Taxonomic classification of metagenomic sequences

    Get PDF
    Gerlach W. Taxonomic classification of metagenomic sequences. Bielefeld: UniversitÀt; 2012.Bacteria, archaea and microeukaryotes can be found in almost every habitat present in nature, in particular in soil, sediments and sea water. They typically live in complex communities with different kinds of symbiotic associations which include relationships with larger organisms like animals or plants. Examples are microbial communities in the gut or on the skin of animals and humans, or bacteria that live in symbiosis with plants. The vast majority of such microbes are unculturable and thus cannot be sequenced by means of traditional methods. The recently upcoming discipline of metagenomics provides various in vivo- and in silico-tools to overcome this limitation. In particular, high-throughput sequencing techniques like 454 or Solexa-Illumina make it possible to explore those microbes by studying whole natural microbial communities and analysing their biological diversity as well as the underlying metabolic pathways. A current limitation of theses technologies is that they can sequence only DNA fragments of a limited length. With this limitation it is usually not possible to recover complete microbial genomes. In addition, the DNA fragments are drawn randomly from the microbial communities and the exact species of origin is unknown. Over the past few years, different methods have been developed for the taxonomic and functional characterization of metagenomic shotgun sequences. However, the taxonomic classification of metagenomic sequences from novel species without close homologues in the biological sequence databases poses a challenge due to the high number of wrong taxonomic predictions on lower taxonomic ranks. In this thesis we present CARMA3, a novel method for the taxonomic classification of assembled and unassembled metagenomic sequences that has been adapted to work with both BLAST and HMMER3 homology searches. CARMA3 accepts protein-encoding DNA sequences, protein sequences, and 16S-rDNA sequences as input. In addition, we present WebCARMA, a web application for the analysis of protein-encoding DNA sequences with CARMA3 without the need for a local installation. We evaluate our novel method in different experiments using simulated and real shotgun metagenomes and show that CARMA3 makes fewer wrong taxonomic predictions (at the same sensitivity) than other BLAST-based methods. In the last experiment we show that also very short reads can, in principle, be used to describe the taxonomic content of a metagenome

    Minimum information about an uncultivated virus genome (MIUVIG)

    Get PDF
    This is the final version. Available on open access from Nature Research via the DOI in this recordNOTE: the full list of funders and grants is in the acknowledgements section of the articleWe present an extension of the Minimum Information about any (x) Sequence (MIxS) standard for reporting sequences of uncultivated virus genomes. Minimum Information about an Uncultivated Virus Genome (MIUViG) standards were developed within the Genomic Standards Consortium framework and include virus origin, genome quality, genome annotation, taxonomic classification, biogeographic distribution and in silico host prediction. Community-wide adoption of MIUViG standards, which complement the Minimum Information about a Single Amplified Genome (MISAG) and Metagenome-Assembled Genome (MIMAG) standards for uncultivated bacteria and archaea, will improve the reporting of uncultivated virus genomes in public databases. In turn, this should enable more robust comparative studies and a systematic exploration of the global virosphere.Simons Foundation InternationalNatural Environment Research Council (NERC

    Taxonomic distribution of large DNA viruses in the sea

    Get PDF
    Phylogenetic mapping of metagenomics data reveals the taxonomic distribution of large DNA viruses in the sea, including giant viruses of the Mimiviridae family

    A Primer on Metagenomics

    Get PDF
    Metagenomics is a discipline that enables the genomic study of uncultured microorganisms. Faster, cheaper sequencing technologies and the ability to sequence uncultured microbes sampled directly from their habitats are expanding and transforming our view of the microbial world. Distilling meaningful information from the millions of new genomic sequences presents a serious challenge to bioinformaticians. In cultured microbes, the genomic data come from a single clone, making sequence assembly and annotation tractable. In metagenomics, the data come from heterogeneous microbial communities, sometimes containing more than 10,000 species, with the sequence data being noisy and partial. From sampling, to assembly, to gene calling and function prediction, bioinformatics faces new demands in interpreting voluminous, noisy, and often partial sequence data. Although metagenomics is a relative newcomer to science, the past few years have seen an explosion in computational methods applied to metagenomic-based research. It is therefore not within the scope of this article to provide an exhaustive review. Rather, we provide here a concise yet comprehensive introduction to the current computational requirements presented by metagenomics, and review the recent progress made. We also note whether there is software that implements any of the methods presented here, and briefly review its utility. Nevertheless, it would be useful if readers of this article would avail themselves of the comment section provided by this journal, and relate their own experiences. Finally, the last section of this article provides a few representative studies illustrating different facets of recent scientific discoveries made using metagenomics

    Bioinformatics for High-throughput Virus Detection and Discovery

    Get PDF
    Pathogen detection is a challenging problem given that any given specimen may contain one or more of many different microbes. Additionally, a specimen may contain microbes that have yet to be discovered. Traditional diagnostics are ill-equipped to address these challenges because they are focused on the detection of a single agent or panel of agents. I have developed three innovative computational approaches for analyzing high-throughput genomic assays capable of detecting many microbes in a parallel and unbiased fashion. The first is a metagenomic sequence analysis pipeline that was initially applied to 12 pediatric diarrhea specimens in order to give the first ever look at the diarrhea virome. Metagenomic sequencing and subsequent analysis revealed a spectrum of viruses in these specimens including known and highly divergent viruses. This metagenomic survey serves as a basis for future investigations about the possible role of these viruses in disease. The second tool I developed is a novel algorithm for diagnostic microarray analysis called VIPR: Viral Identification with a PRobabilistic algorithm). The main advantage of VIPR relative to other published methods for diagnostic microarray analysis is that it relies on a training set of empirical hybridizations of known viruses to guide future predictions. VIPR uses a Bayesian statistical framework in order to accomplish this. A set of hemorrhagic fever viruses and their relatives were hybridized to a total of 110 microarrays in order to test the performance of VIPR. VIPR achieved an accuracy of 94% and outperformed existing approaches for this dataset. The third tool I developed for pathogen detection is called VIPR HMM. VIPR HMM expands upon VIPR\u27s previous implementation by incorporating a hidden Markov model: HMM) in order to detect recombinant viruses. VIPR HMM correctly identified 95% of inter-species breakpoints for a set of recombinant alphaviruses and flaviviruses Mass sequencing and diagnostic microarrays require robust computational tools in order to make predictions regarding the presence of microbes in specimens of interest. High-throughput diagnostic assays coupled with powerful analysis tools have the potential to increase the efficacy with which we detect pathogens and treat disease as these technologies play more prominent roles in clinical laboratories

    PLoS One

    Get PDF
    The evolutionary classification of influenza genes into lineages is a first step in understanding their molecular epidemiology and can inform the subsequent implementation of control measures. We introduce a novel approach called Lineage Assignment By Extended Learning (LABEL) to rapidly determine cladistic information for any number of genes without the need for time-consuming sequence alignment, phylogenetic tree construction, or manual annotation. Instead, LABEL relies on hidden Markov model profiles and support vector machine training to hierarchically classify gene sequences by their similarity to pre-defined lineages. We assessed LABEL by analyzing the annotated hemagglutinin genes of highly pathogenic (H5N1) and low pathogenicity (H9N2) avian influenza A viruses. Using the WHO/FAO/OIE H5N1 evolution working group nomenclature, the LABEL pipeline quickly and accurately identified the H5 lineages of uncharacterized sequences. Moreover, we developed an updated clade nomenclature for the H9 hemagglutinin gene and show a similarly fast and reliable phylogenetic assessment with LABEL. While this study was focused on hemagglutinin sequences, LABEL could be applied to the analysis of any gene and shows great potential to guide molecular epidemiology activities, accelerate database annotation, and provide a data sorting tool for other large-scale bioinformatic studies

    K-meeridel pÔhinevad meetodid bakterite ja plasmiidide tuvastamiseks

    Get PDF
    VĂ€itekirja elektrooniline versioon ei sisalda publikatsiooneMikroorganismid on Maad asustanud juba miljardeid aastaid ning neid leidub peaaegu kĂ”ikjal. Isegi meie oleme nendega lahutamatult seotud – baktereid elab nii meie nahal kui ka soolestikus. Osad bakteritest vĂ”ivad aga olla patogeensed ja pĂ”hjustada haigusi. NĂ€iteks oli keskajal suure hulga elanikkonnast tapnud Musta Surma pĂ”hjustajaks katkubakter Yersinia pestis. TĂ€napĂ€eval aitavad meid bakterite vastu antibiootikumid, kuid jĂ€rjest suurem probleem on antibiootikumiresistentsuse laialdane levik. Sellele aitavad kaasa plasmiidid – bakterites olevad DNA jĂ€rjestused, mis on bakteri enda kromosoomist eraldiseisvad ning mida bakterid vĂ”ivad kiirelt ĂŒksteisele edasi anda. KĂ€esoleva doktoritöö eesmĂ€rgiks oli luua bakterite ja plasmiidide tuvastamiseks meetodid, mis vĂ”imaldaksid töötada sekveneerimiskeskuste poolt toodetud toorandmetega. Ülesande lahendamiseks otsustasime kasutada k-meeridel pĂ”hinevat analĂŒĂŒsi. K-meer tĂ€histab lĂŒhikest DNA juppi pikkusega k nukleotiidi. Pikema DNA jĂ€rjestuse, nĂ€iteks bakterigenoomi, saab jagada lĂŒhemateks k-meerideks ning vaadelda seda kui k-meeride kogumit. Sellise lĂ€henemise eeliseks on sĂ”ltumatus lugemi pikkusest – kĂ”ik lugemid sisaldavad k-meere ning analĂŒĂŒsides k-meeride hulki, on vĂ”imalik mÀÀrata algse proovi koostis. StrainSeeker on meie töögrupis loodud programm bakteritĂŒvede mÀÀramiseks. Me arendasime vĂ€lja uudse algoritmi, mis nĂ€itab proovis esineva bakteri eeldatavat asukohta kasutaja poolt ette antaval fĂŒlogeneetilisel puul. LĂ”ime ka visuaalse kasutajaliidesega veebiserveri. Plasmiidide tuvastamiseks eeldasime, et plasmiidide arv bakteri rakus on tavaliselt suurem bakteri kromosoomi omast, seega vĂ”iks ka plasmiidi k-meeride keskmine esinemissagedus olla suurem kui bakteri kromosoomi k-meeride puhul. Me testisime oma programmi, mis sai nimeks PlasmidSeeker, nii simuleeritud kui ka reaalsete bakteri tĂ€isgenoomi sekveneerimisandmestikega, millede puhul oli teada proovide tegelik koostis. PlasmidSeeker leidis ĂŒles kĂ”ik proovides olnud plasmiidid ning mÀÀras tĂ€pselt ka nende koopiaarvu. KokkuvĂ”ttes oleme oma tööga andnud panuse arvutuslikku mikrobioloogiasse, luues uued vĂ”imalused bakteriaalsete proovide analĂŒĂŒsiks.Microbes have roamed Earth for billions of years and can be found almost anywhere. They are present even on our skin and in our gut. However, some bacteria can be pathogenic and cause diseases. For instance, the Black Death, which killed millions during the Middle Ages, was caused by the bacterium Yersinia pestis. Nowadays, antibiotics protect us against the bacterial threat, but a new problem is looming – widespread antibiotic resistance. This is partly facilitated by plasmids – DNA sequences which are separate from the bacterial chromosome and can be readily passed from one bacterium to the other. The general goal of this work was to develop methods for the identification of bacteria and plasmids from raw data produced by sequencing centers. We decided to use k-mer based analysis for this task. K-mer itself is simply a short stretch of DNA with a length of k nucleotides. A long DNA sequence, such as a bacterial genome, can be divided into shorter k-mers and analyzed as a whole. This has the advantage of not being limited by read length – any read contains k-mers and by analyzing these, we can identify the contents of the sample. StrainSeeker is a bacterial identification program developed by our group. We developed a novel algorithm that predicts the location of an isolated bacterium on the user-provided phylogenetic tree. Also, we created a web server with a visual interface for users with limited bioinformatics experience. For plasmid detection, we assumed that the plasmid copy number is usually higher compared to the bacterial chromosome. This means that the average frequency of plasmid k-mers should also be higher than the frequency of chromosomal k-mers. We named the program PlasmidSeeker and tested it with real and simulated bacterial whole genome sequencing samples, in which the real plasmid content was known. PlasmidSeeker detected all plasmids and accurately estimated their copy numbers. With our work, we have made a contribution to the field of computational microbiology and provided novel means for the analysis of bacterial samples
    • 

    corecore