15 research outputs found

    Space-efficient merging of succinct de Bruijn graphs

    Full text link
    We propose a new algorithm for merging succinct representations of de Bruijn graphs introduced in [Bowe et al. WABI 2012]. Our algorithm is based on the lightweight BWT merging approach by Holt and McMillan [Bionformatics 2014, ACM-BCB 2014]. Our algorithm has the same asymptotic cost of the state of the art tool for the same problem presented by Muggli et al. [bioRxiv 2017, Bioinformatics 2019], but it uses less than half of its working space. A novel important feature of our algorithm, not found in any of the existing tools, is that it can compute the Variable Order succinct representation of the union graph within the same asymptotic time/space bounds.Comment: Accepted to SPIRE'1

    Space efficient merging of de Bruijn graphs and Wheeler graphs

    Full text link
    The merging of succinct data structures is a well established technique for the space efficient construction of large succinct indexes. In the first part of the paper we propose a new algorithm for merging succinct representations of de Bruijn graphs. Our algorithm has the same asymptotic cost of the state of the art algorithm for the same problem but it uses less than half of its working space. A novel important feature of our algorithm, not found in any of the existing tools, is that it can compute the Variable Order succinct representation of the union graph within the same asymptotic time/space bounds. In the second part of the paper we consider the more general problem of merging succinct representations of Wheeler graphs, a recently introduced graph family which includes as special cases de Bruijn graphs and many other known succinct indexes based on the BWT or one of its variants. We show that Wheeler graphs merging is in general a much more difficult problem, and we provide a space efficient algorithm for the slightly simplified problem of determining whether the union graph has an ordering that satisfies the Wheeler conditions.Comment: 24 pages, 10 figures. arXiv admin note: text overlap with arXiv:1902.0288

    ALGORITHMS AND DATA STRUCTURES FOR INDEXING, QUERYING, AND ANALYZING LARGE COLLECTIONS OF SEQUENCING DATA IN THE PRESENCE OR ABSENCE OF A REFERENCE

    Get PDF
    High-throughput sequencing has helped to transform our study of biological organisms and processes. For example, RNA-seq is one popular sequencing assay that allows measuring dynamic transcriptomes and enables the discovery (via assem- bly) of novel transcripts. Likewise, metagenomic sequencing lets us probe natural environments to profile organismal diversity and to discover new strains and species that may be integral to the environment or process being studied. The vast amount of available sequencing data, and its growth rate over the past decade also brings with it some immense computational challenges. One of these is how do we design memory-efficient structures for indexing and querying this data. This challenge is not limited to only raw sequencing data (i.e. reads) but also to the growing collection of reference sequences (genomes, and genes) that are assembled from this raw data. We have developed new data structures (both reference-based and reference-free) to index raw sequencing data and assembled reference sequences. Specifically, we describe three separate indices, “Pufferfish”, an index over a set of genomes or tran- scriptomes, and “Rainbowfish” and “Mantis” which are both indices for indexing a set of raw sequencing data sets. All of these indices are designed with consideration of support for high query performance and memory efficient construction and query. The Pufferfish data structure is based on constructing a compacted, colored, reference de Bruijn graph (ccdbg), and then indexing this structure in an efficient manner. We have developed both sparse and dense indexing schemes which allow trading index space for query speed (though queries always remain asymptotically optimal). Pufferfish provides a full reference index that can return the set of refer- ences, positions and orientations of any k-mer (substring of fixed length “k” ) in the input genomes. We have built an alignment tool, Puffaligner, around this index for aligning sequencing reads to reference sequences. We demonstrate that Puffaligner is able to produce highly-sensitive alignments, similar to those of Bowtie2, but much more quickly and exhibits speed similar to the ultrafast STAR aligner while requiring considerably less memory to construct its index and align reads. The Rainbowfish and Mantis data structures, on the other hand, are based on reference-free colored de Bruijn graphs (cdbg) constructed over raw sequencing data. Rainbowfish introduces a new efficient representation of the color information which is then adopted and refined by Mantis. Mantis supports graph traversal and other topological analyses, but is also particularly well-suited for large-scale sequence-level search over thousands of samples. We develop multiple and successively-refined versions of the Mantis index, culminating in an index that adopts a minimizer- partitioned representation of the underlying k-mer set and a referential encoding of the color information that exploits fast near-neighbor search and efficient encoding via a minimum spanning tree. We describe, further, how this index can be made incrementally updatable by developing an efficient merge algorithm and storing the overall index in a multi-level log-structured merge (LSM) tree. We demonstrate the utility of this index by building a searchable Mantis, via recursive merging, over 10,000 raw sequencing samples, which we then scale to over 15,000 samples via incremental update. This index can be queried, on a commodity server, to discover the samples likely containing thousands of reference sequences in only a few minutes

    LIPIcs, Volume 248, ISAAC 2022, Complete Volume

    Get PDF
    LIPIcs, Volume 248, ISAAC 2022, Complete Volum

    27th Annual European Symposium on Algorithms: ESA 2019, September 9-11, 2019, Munich/Garching, Germany

    Get PDF

    Combinatorics

    Get PDF
    Combinatorics is a fundamental mathematical discipline that focuses on the study of discrete objects and their properties. The present workshop featured research in such diverse areas as Extremal, Probabilistic and Algebraic Combinatorics, Graph Theory, Discrete Geometry, Combinatorial Optimization, Theory of Computation and Statistical Mechanics. It provided current accounts of exciting developments and challenges in these fields and a stimulating venue for a variety of fruitful interactions. This is a report on the meeting, containing extended abstracts of the presentations and a summary of the problem session

    Towards Democratizing the Fabrication of Electrochromic Displays

    Get PDF
    corecore