1,419 research outputs found

    Optimal-Time Text Indexing in BWT-runs Bounded Space

    Full text link
    Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is rr, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r)O(r) space and was able to efficiently count the number of occurrences of a pattern of length mm in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of rr. Since then, a number of other indexes with space bounded by other measures of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors --- have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occocc occurrences efficiently within O(r)O(r) space (in loglogarithmic time each), and reaching optimal time O(m+occ)O(m+occ) within O(rlog(n/r))O(r\log(n/r)) space, on a RAM machine of w=Ω(logn)w=\Omega(\log n) bits. Within O(rlog(n/r))O(r\log (n/r)) space, our index can also count in optimal time O(m)O(m). Raising the space to O(rwlogσ(n/r))O(r w\log_\sigma(n/r)), we support count and locate in O(mlog(σ)/w)O(m\log(\sigma)/w) and O(mlog(σ)/w+occ)O(m\log(\sigma)/w+occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(rlog(n/r))O(r\log(n/r)) space that replaces the text and extracts any text substring of length \ell in almost-optimal time O(log(n/r)+log(σ)/w)O(\log(n/r)+\ell\log(\sigma)/w). (...continues...

    Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space

    Get PDF
    Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n over an alphabet of size {\sigma} on a RAM machine with words of w = {\Omega}(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length ` in almost-optimal time O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log log_w(n/r + sigma)

    Client side decompression technique provides faster DNA sequence data delivery

    Get PDF
    DNA sequences are generally very long chains of sequentially linked nucleotides. There are four different nucleotides and combinations of these build the nucleotide information of sequence files contained in data sources. When a user searches for any sequence for an organism, a compressed sequence file can be sent from the data source to the user. The compressed file then can be decompressed at the client end resulting in reduced transmission time over the Internet. A compression algorithm that provides a moderately high compression rate with minimal decompression time is proposed in this paper. We also compare a number of different compression techniques for achieving efficient delivery methods from an intelligent genomic search agent over the Interne

    Faster repetition-aware compressed suffix trees based on Block Trees

    Get PDF
    The suffix tree is a fundamental data structure in stringology, but its space usage, though linear, is an important problem in applications like Bioinformatics. We design and implement a new compressed suffix tree (CST) targeted to highly repetitive texts, such as large genomic collections of the same species. Our first contribution is to enhance the Block Tree, a data structure that captures the repetitiveness of its input sequence, to represent the topology of trees with large repeated subtrees. Our so-called Block-Tree Compressed Topology (BT-CT) data structure augments the Block Tree nodes with data that speeds up tree navigation. Our Block-Tree CST (BT-CST), in turn, uses the BT-CT to compress the topology of the suffix tree, and also replaces the sampling of the suffix array and its inverse with grammar-and/or Block-Tree-based representations of those arrays. Our experimental results show that BT-CTs reach navigation speeds comparable to compact tree representations that are insensitive to repetitiveness, while using 2-10 times less space on the topologies of the suffix trees of repetitive collections. Our BT-CST is slightly larger than previous repetition-aware suffix trees based on grammar-compressed topologies, but outperforms them in time, often by orders of magnitude.(c) 2021 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).Peer reviewe

    Block trees

    Get PDF
    Let string S[1..n] be parsed into z phrases by the Lempel-Ziv algorithm. The corresponding compression algorithm encodes S in O(z) space, but it does not support random access to S. We introduce a data structure, the block tree, that represents S in O(z log(n/z)) space and extracts any symbol of S in time O(log(n/z)), among other space-time tradeoffs. The structure also supports other queries that are useful for building compressed data structures on top of S. Further, block trees can be built in linear time and in a scalable manner. Our experiments show that block trees offer relevant space-time tradeoffs compared to other compressed string representations for highly repetitive strings. (C) 2020 Elsevier Inc. All rights reserved.Peer reviewe
    corecore