32 research outputs found

    GAM-NGS: genomic assemblies merger for next generation sequencing

    Get PDF
    Background: In recent years more than 20 assemblers have been proposed to tackle the hard task of assembling NGS data. A common heuristic when assembling a genome is to use several assemblers and then select the best assembly according to some criteria. However, recent results clearly show that some assemblers lead to better statistics than others on specific regions but are outperformed on other regions or on different evaluation measures. To limit these problems we developed GAM-NGS (Genomic Assemblies Merger for Next Generation Sequencing), whose primary goal is to merge two or more assemblies in order to enhance contiguity and correctness of both. GAM-NGS does not rely on global alignment: regions of the two assemblies representing the same genomic locus (called blocks) are identified through reads' alignments and stored in a weighted graph. The merging phase is carried out with the help of this weighted graph that allows an optimal resolution of local problematic regions.Results: GAM-NGS has been tested on six different datasets and compared to other assembly reconciliation tools. The availability of a reference sequence for three of them allowed us to show how GAM-NGS is a tool able to output an improved reliable set of sequences. GAM-NGS is also a very efficient tool able to merge assemblies using substantially less computational resources than comparable tools. In order to achieve such goals, GAM-NGS avoids global alignment between contigs, making its strategy unique among other assembly reconciliation tools.Conclusions: The difficulty to obtain correct and reliable assemblies using a single assembler is forcing the introduction of new algorithms able to enhance de novo assemblies. GAM-NGS is a tool able to merge two or more assemblies in order to improve contiguity and correctness. It can be used on all NGS-based assembly projects and it shows its full potential with multi-library Illumina-based projects. With more than 20 available assemblers it is hard to select the best tool. In this context we propose a tool that improves assemblies (and, as a by-product, perhaps even assemblers) by merging them and selecting the generating that is most likely to be correct

    Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

    Get PDF
    Background: The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. Results: In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Conclusions: Many current genome assemblers produced useful assemblies, containing a significant representation of their genes and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another

    Alignment and reconciliation strategies for large-scale de novo assembly

    Get PDF
    The theme of the thesis is sequencing (large) genomes and assembling them: an area at the intersection of algorithmics and technology. The birth of next-generation sequencing (NGS) and third-generation sequencing (TGS) platforms dropped the costs of genome analysis by orders of magnitude compared to the older (Sanger) method. These events also paved the way to a continuously increasing number of genome sequencing projects and the need of redesigning several algorithms (as well as data structures) in order to cope with the computational challenges introduced by the latest technologies. In this dissertation we explore two major problems: de novo assembly and long-sequence alignment. The former has been tackled, first, with a global approach and then by taking advantage of a hierarchical scheme (more natural considering the type of dataset at our disposal). More precisely, we proposed a novel assembly reconciliation tool which also proved to be competitive with state-of-the-art competitors and the only one able to scale with large datasets. The second problem analyzed, instead, has been studied in order to extend and speed up a computationally critical phase of the first one. Specifically, it consists in aligning and merging pools of long assembled sequences, each one representing a small fraction of the genome and independently assembled from NGS data. We devised a hierarchical framework (HAM) and a fingerprint-based algorithm (DFP) for merging and detecting overlaps between long and accurate sequences. Also in this case, the tools we developed achieved comparable results with state-of-the-art softwares, while using considerably less computational resource

    Metagenomic assembly of complex ecosystems with highly accurate long-reads

    No full text
    International audienceUnderstanding the poorly characterized communities of soil and rhizosphere microbiota is crucial for plant growth and health. In line with this goal, the MISTIC project seeks to develop methodologies for modeling microbial community dynamics. My thesis contributes to this effort by specifically addressing the challenge of assembling genomes from these complex communities using highly accurate long reads

    A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling

    No full text
    International audienceBackground: Biochemical and regulatory pathways have until recently been thought and modelled within one cell type, one organism and one species. This vision is being dramatically changed by the advent of whole microbiome sequencing studies, revealing the role of symbiotic microbial populations in fundamental biochemical functions. The new landscape we face requires the reconstruction of biochemical and regulatory pathways at the community level in a given environment. In order to understand how environmental factors affect the genetic material and the dynamics of the expression from one environment to another, we want to evaluate the quantity of gene protein sequences or transcripts associated to a given pathway by precisely estimating the abundance of protein domains, their weak presence or absence in environmental samples.Results: MetaCLADE is a novel profile-based domain annotation pipeline based on a multi-source domain annotation strategy. It applies directly to reads and improves identification of the catalog of functions in microbiomes. MetaCLADE is applied to simulated data and to more than ten metagenomic and metatranscriptomic datasets from different environments where it outperforms InterProScan in the number of annotated domains. It is compared to the state-of-the-art non-profile-based and profile-based methods, UProC and HMM-GRASPx, showing complementary predictions to UProC. A combination of MetaCLADE and UProC improves even further the functional annotation of environmental samples.Conclusions: Learning about the functional activity of environmental microbial communities is a crucial step to understand microbial interactions and large-scale environmental impact. MetaCLADE has been explicitly designed for metagenomic and metatranscriptomic data and allows for the discovery of patterns in divergent sequences, thanks to its multi-source strategy. MetaCLADE highly improves current domain annotation methods and reaches a fine degree of accuracy in annotation of very different environments such as soil and marine ecosystems, ancient metagenomes and human tissues

    Metagenomic assembly of complex ecosystems with highly accurate long-reads

    No full text
    Understanding the poorly characterized communities of soil and rhizosphere microbiota is crucial for plant growth and health. In line with this goal, the MISTIC project seeks to develop methodologies for modeling microbial community dynamics. My thesis contributes to this effort by specifically addressing the challenge of assembling genomes from these complex communities using highly accurate long reads.Computationel models of crop plant microbial biodiversit

    Targeted domain assembly for fast functional profiling of metagenomic datasets with S3A

    Get PDF
    International audienceMotivation: The understanding of the ever-increasing number of metagenomic sequences accumulating in our databases demands for approaches that rapidly 'explore' the content of multiple and/or large metagenomic datasets with respect to specific domain targets, avoiding full domain annotation and full assembly.Results: S3A is a fast and accurate domain-targeted assembler designed for a rapid functional profiling. It is based on a novel construction and a fast traversal of the Overlap-Layout-Consensus graph, designed to reconstruct coding regions from domain annotated metagenomic sequence reads. S3A relies on high-quality domain annotation to efficiently assemble metagenomic sequences and on the design of a new confidence measure for a fast evaluation of overlapping reads. Its implementation is highly generic and can be applied to any arbitrary type of annotation. On simulated data, S3A achieves a level of accuracy similar to that of classical metagenomics assembly tools while permitting to conduct a faster and sensitive profiling on domains of interest. When studying a few dozens of functional domains-a typical scenario-S3A is up to an order of magnitude faster than general purpose metagenomic assem-blers, thus enabling the analysis of a larger number of datasets in the same amount of time. S3A opens new avenues to the fast exploration of the rapidly increasing number of metagenomic datasets displaying an ever-increasing size. Availability and implementation: S3A is available at http://www.lcqb.upmc.fr/S3A_ASSEMBLER/

    Metagenomic assembly of complex ecosystems with highly accurate long-reads

    No full text
    International audienceUnderstanding the poorly characterized communities of soil and rhizosphere microbiota is crucial for plant growth and health. In line with this goal, the MISTIC project seeks to develop methodologies for modeling microbial community dynamics. My thesis contributes to this effort by specifically addressing the challenge of assembling genomes from these complex communities using highly accurate long reads

    Metagenomic assembly of complex ecosystems with highly accurate long-reads

    No full text
    International audienceUnderstanding the poorly characterized communities of soil and rhizosphere microbiota is crucial for plant growth and health. In line with this goal, the MISTIC project seeks to develop methodologies for modeling microbial community dynamics. My thesis contributes to this effort by specifically addressing the challenge of assembling genomes from these complex communities using highly accurate long reads

    Hierarchical Assembly of Pools

    No full text
    This study introduces a method to address the problem of building a draft de novo assembly of complex genomes when a collection of well-assembled long-insert pools is available. Sequencing and assembling a collection of such pools reduces the complexity of the assembly and has been proven to be a viable strategy in order to carry out downstream analyses in recent sequencing projects. In this work we depict a framework to tackle this problem: we propose a novel fingerprinting technique to speed up overlap detection and we describe a merging technique based on the well established string graph structure in order to carry out the reconciliation step. Finally, we show some preliminary results on simulated data sets based on the human chromosome 14 obtained with an early implementation of a tool we called Hierarchical Assemblies Merger
    corecore