31 research outputs found

    Using cascading Bloom filters to improve the memory usage for de Brujin graphs

    Get PDF
    De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing data. Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently. In this work, we show how to reduce the memory required by the algorithm of [3] that represents de Brujin graphs using Bloom filters. Our method requires 30% to 40% less memory with respect to the method of [3], with insignificant impact to construction time. At the same time, our experiments showed a better query time compared to [3]. This is, to our knowledge, the best practical representation for de Bruijn graphs.Comment: 12 pages, submitte

    Recovering complete and draft population genomes from metagenome datasets.

    Get PDF
    Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem of chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution

    DIDA: Distributed Indexing Dispatched Alignment

    Get PDF
    One essential application in bioinformatics that is affected by the high-throughput sequencing data deluge is the sequence alignment problem, where nucleotide or amino acid sequences are queried against targets to find regions of close similarity. When queries are too many and/or targets are too large, the alignment process becomes computationally challenging. This is usually addressed by preprocessing techniques, where the queries and/or targets are indexed for easy access while searching for matches. When the target is static, such as in an established reference genome, the cost of indexing is amortized by reusing the generated index. However, when the targets are non-static, such as contigs in the intermediate steps of a de novo assembly process, a new index must be computed for each run. To address such scalability problems, we present DIDA, a novel framework that distributes the indexing and alignment tasks into smaller subtasks over a cluster of compute nodes. It provides a workflow beyond the common practice of embarrassingly parallel implementations. DIDA is a cost-effective, scalable and modular framework for the sequence alignment problem in terms of memory usage and runtime. It can be employed in large-scale alignments to draft genomes and intermediate stages of de novo assembly runs. The DIDA source code, sample files and user manual are available through http://www.bcgsc.ca/platform/bioinfo/software/dida. The software is released under the British Columbia Cancer Agency License (BCCA), and is free for academic use

    EXFI: Exon and splice graph prediction without a reference genome

    Get PDF
    For population genetic studies in nonmodel organisms, it is important to use every single source of genomic information. This paper presents EXFI, a Python pipeline that predicts the splice graph and exon sequences using an assembled transcriptome and raw whole-genome sequencing reads. The main algorithm uses Bloom filters to remove reads that are not part of the transcriptome, to predict the intron-exon boundaries, to then proceed to call exons from the assembly, and to generate the underlying splice graph. The results are returned in GFA1 format, which encodes both the predicted exon sequences and how they are connected to form transcripts.Basque Government, Grant/Award Number: predoctoral grant PRE_ 2017_2_0169 and grant IT558-1

    Speeding up NGS software development

    Get PDF
    International audienceThe analysis of NGS data remains a time and space-consuming task. Many efforts have been made toprovide efficient data structures for indexing the terabytes of data generated by the fast sequencingmachines (Suffix Array, Burrows-Wheeler transform, Bloom Filter, etc.). Mapper tools, genomeassemblers, SNP callers, etc., make an intensive use of these data structures to keep their memoryfootprint as lower as possible.The overall efficiency of NGS software is brought by a smart combination of how data are representedinside the computer memory and how they are processed through the available processing units insidea processor. Developing such software is thus a real challenge, as it requires a large spectrum ofcompetences from high-level data structure and algorithm concepts to tiny details of implementation.We have developed a C++ library, called GATB (Genomic Assembly and Analysis Tool Box) tospeed up the design of NGS algorithms. This library offers a panel of high-level optimized buildingblocks. The underlying data structure is the de Bruijn graph, and the general parallelism model ismultithreading. The GATB library targets standard computing resources such as current multicoreprocessor (laptop computer, small server) with a few GB of memory. Hence, from high-level C++API, NGS programing designers can rapidly elaborate their own software based on state-of-the-artalgorithms and data structures of the domain.To demonstrate the efficiency of the GATB library, several NGS software have been designed such ascontiger (Minia), read corrector (Bloocoo) or SNP discovery (DiscoSNP). The GATB library iswritten in C++ and is available at the following web site http://gatb.inria.fr under the GNU AfferoGPL license

    Efficient Reconciliation of Genomic Datasets of High Similarity

    Get PDF
    We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k-mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k-mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k-mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k-mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k-mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data (SARS-CoV-2 and Streptococcus Pneumoniae genomes)
    corecore