56 research outputs found

    gam genomic assemblies merger

    Get PDF
    Motivations. In the last 3 years more than 20 assemblers have been proposed to tackle the hard task of assembling. Recent evaluation efforts (Assemblathon 1 and GAGE) demonstrated that none of these tools clearly outperforms the others. However, results clearly show that some assemblers performs better than others on specific regions and statistics while poorly performing on other regions and evaluation measures. With this picture in mind we developed GAM (Genomic Assemblies Merger) whose primary goal is to merge two or more assemblies in order to obtain a more contiguous one. Moreover, as a by-product of the merging step, GAM is able to correct mis-assemblies. GAM does not need global alignment between contigs, making it unique among others Assembly Reconciliation tools. In this way a computationally expensive alignment is avoided, and paralog sequences (likely to create false connection among contigs) do not represent a problem. GAM procedure is based only on the information coming from reads used in the assembling phases, and it can be used even on assemblies obtained with different datasets. Methods. Let us concentrate on the the merging of two assemblies, dubbed M and S. As a preprocessing step, that is an almost mandatory analysis, reads (or a subset of them) used in the assembling phase are aligned against M and S using a SAM-compatible aligner (e.g., BWA, rNA). GAM takes as input M, S and the two SAM files produced in the preprocessing step. The main idea is to identify fragments belonging to M and S having high similarity. For this purpose, GAM identifies regions, named blocks, belonging to M and S that share an high enough amount of reads (i.e. regions sharing the same aligned reads). After all blocks are identified the Assembly Graph (AG) is built: each node corresponds to a block and a directed edge connects block A to block B if the first precedes the second in either M or S (see Fig.1). Once AG is available, the merging phase can start. As a first step GAM identifies genomic regions in which assemblies contradict each other (loops, bifurcations, etc.). These areas represent potential inconsistencies between the two sequences. We chose to be as much conservative as possible electing (for example) M to be the Master assembly: all its contigs are supposed to be correct and cannot be contradicted. S becomes the Slave and everywhere an inconsistency is found, M is preferred to S. After the identification and the resolution of problematic regions, GAM visits the simplified graph, merges contigs accordingly to blocks and edges in AG (each merging phase is performed using a Smith-Waterman algorithm variant) and finally outputs the new improved assembly. GAM is not only limited to contigs, it can also work with scaffolds, filling the N's inserted by an assembler and not by the other. Results. GAM has been tested on several real datasets, in particular on Olea's chloroplast (241X Illumina paired reads and 21X 454 paired reads), Populus trichocarpa (82X Illumina paired reads), boa constrictor (40X Illumina paired reads). Illumina reads have average length of 100 bp and insert size of 500 bp. All tests have been performed on a computer equipped with 8 cores and 32GB RAM. ABySS and CLC were selected as assemblers. Results are summarized in Fig. 1. Olea's chloroplast has been used as a proof of concept experiment. The presence of a reference sequence allowed GAM's output validation (using dnadiff). Two assemblies were obtained with CLC using Illumina and 454 data. GAM was used to merge them. Figure 1 shows how GAM assembly is not only more contiguous but also more correct: while Master (CLC-Illumina) and Slave (CLC-454) have 58 and 39 suspicious regions respectively, GAM has only 14 of those. On Populus trichocarpa and Boa constrictor, CLC assemblies were used as master due to their better contiguity. In both cases assemblies returned by GAM were more contiguous (see Fig. 1)

    Genotype-Phenotype Correlation in a Family with Brugada Syndrome Harboring the Novel p.Gln371* Nonsense Variant in the SCN5A Gene

    Get PDF
    Brugada syndrome (BrS) is marked by coved ST-segment elevation and increased risk of sudden cardiac death. The genetics of this syndrome are elusive in over half of the cases. Variants in the SCN5A gene are the single most common known genetic unifier, accounting for about a third of cases. Research models, such as animal models and cell lines, are limited. In the present study, we report the novel NM_198056.2:c.1111C>T (p.Gln371*) heterozygous variant in the SCN5A gene, as well as its segregation with BrS in a large family. The results herein suggest a pathogenic effect of this variant. Functional studies are certainly warranted to characterize the molecular effects of this variant

    Recovering complete and draft population genomes from metagenome datasets

    Get PDF
    Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem of chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution

    Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

    Get PDF
    Background: The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. Results: In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Conclusions: Many current genome assemblers produced useful assemblies, containing a significant representation of their genes and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another

    Strainberry: automated strain separation in low-complexity metagenomes using long reads

    Full text link
    High-throughput short-read metagenomics has enabled large-scale species-level analysis and functional characterization of microbial communities. Microbiomes often contain multiple strains of the same species, and different strains have been shown to have important differences in their functional roles. Recent advances on long-read based methods enabled accurate assembly of bacterial genomes from complex microbiomes and an as-yet-unrealized opportunity to resolve strains. Here we present Strainberry, a metagenome assembly pipeline that performs strain separation in single-sample low-complexity metagenomes and that relies uniquely on long-read data. We benchmarked Strainberry on mock communities for which it produces strain-resolved assemblies with near-complete reference coverage and 99.9% base accuracy. We also applied Strainberry on real datasets for which it improved assemblies generating 20-118% additional genomic material than conventional metagenome assemblies on individual strain genomes. We show that Strainberry is also able to refine microbial diversity in a complex microbiome, with complete separation of strain genomes. We anticipate this work to be a starting point for further methodological improvements on strain-resolved metagenome assembly in environments of higher complexities
    • …
    corecore