3,401 research outputs found

    Online Self-Indexed Grammar Compression

    Full text link
    Although several grammar-based self-indexes have been proposed thus far, their applicability is limited to offline settings where whole input texts are prepared, thus requiring to rebuild index structures for given additional inputs, which is often the case in the big data era. In this paper, we present the first online self-indexed grammar compression named OESP-index that can gradually build the index structure by reading input characters one-by-one. Such a property is another advantage which enables saving a working space for construction, because we do not need to store input texts in memory. We experimentally test OESP-index on the ability to build index structures and search query texts, and we show OESP-index's efficiency, especially space-efficiency for building index structures.Comment: To appear in the Proceedings of the 22nd edition of the International Symposium on String Processing and Information Retrieval (SPIRE2015

    Universal Compressed Text Indexing

    Get PDF
    The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let γ\gamma be the size of a string attractor for a text of length nn. Our index takes O(γlog(n/γ))O(\gamma\log(n/\gamma)) words of space and supports locating the occocc occurrences of any pattern of length mm in O(mlogn+occlogϵn)O(m\log n + occ\log^{\epsilon}n) time, for any constant ϵ>0\epsilon>0. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment

    siEDM: an efficient string index and search algorithm for edit distance with moves

    Full text link
    Although several self-indexes for highly repetitive text collections exist, developing an index and search algorithm with editing operations remains a challenge. Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string into another. Although the problem of computing EDM is intractable, it has a wide range of potential applications, especially in approximate string retrieval. Despite the importance of computing EDM, there has been no efficient method for indexing and searching large text collections based on the EDM measure. We propose the first algorithm, named string index for edit distance with moves (siEDM), for indexing and searching strings with EDM. The siEDM algorithm builds an index structure by leveraging the idea behind the edit sensitive parsing (ESP), an efficient algorithm enabling approximately computing EDM with guarantees of upper and lower bounds for the exact EDM. siEDM efficiently prunes the space for searching query strings by the proposed method, which enables fast query searches with the same guarantee as ESP. We experimentally tested the ability of siEDM to index and search strings on benchmark datasets, and we showed siEDM's efficiency.Comment: 23 page

    Universal Indexes for Highly Repetitive Document Collections

    Get PDF
    Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    Computing LZ77 in Run-Compressed Space

    Get PDF
    In this paper, we show that the LZ77 factorization of a text T {\in\Sigma^n} can be computed in O(R log n) bits of working space and O(n log R) time, R being the number of runs in the Burrows-Wheeler transform of T reversed. For extremely repetitive inputs, the working space can be as low as O(log n) bits: exponentially smaller than the text itself. As a direct consequence of our result, we show that a class of repetition-aware self-indexes based on a combination of run-length encoded BWT and LZ77 can be built in asymptotically optimal O(R + z) words of working space, z being the size of the LZ77 parsing

    Improved ESP-index: a practical self-index for highly repetitive texts

    Full text link
    While several self-indexes for highly repetitive texts exist, developing a practical self-index applicable to real world repetitive texts remains a challenge. ESP-index is a grammar-based self-index on the notion of edit-sensitive parsing (ESP), an efficient parsing algorithm that guarantees upper bounds of parsing discrepancies between different appearances of the same subtexts in a text. Although ESP-index performs efficient top-down searches of query texts, it has a serious issue on binary searches for finding appearances of variables for a query text, which resulted in slowing down the query searches. We present an improved ESP-index (ESP-index-I) by leveraging the idea behind succinct data structures for large alphabets. While ESP-index-I keeps the same types of efficiencies as ESP-index about the top-down searches, it avoid the binary searches using fast rank/select operations. We experimentally test ESP-index-I on the ability to search query texts and extract subtexts from real world repetitive texts on a large-scale, and we show that ESP-index-I performs better that other possible approaches.Comment: This is the full version of a proceeding accepted to the 11th International Symposium on Experimental Algorithms (SEA2014

    Rank, select and access in grammar-compressed strings

    Full text link
    Given a string SS of length NN on a fixed alphabet of σ\sigma symbols, a grammar compressor produces a context-free grammar GG of size nn that generates SS and only SS. In this paper we describe data structures to support the following operations on a grammar-compressed string: \mbox{rank}_c(S,i) (return the number of occurrences of symbol cc before position ii in SS); \mbox{select}_c(S,i) (return the position of the iith occurrence of cc in SS); and \mbox{access}(S,i,j) (return substring S[i,j]S[i,j]). For rank and select we describe data structures of size O(nσlogN)O(n\sigma\log N) bits that support the two operations in O(logN)O(\log N) time. We propose another structure that uses O(nσlog(N/n)(logN)1+ϵ)O(n\sigma\log (N/n)(\log N)^{1+\epsilon}) bits and that supports the two queries in O(logN/loglogN)O(\log N/\log\log N), where ϵ>0\epsilon>0 is an arbitrary constant. To our knowledge, we are the first to study the asymptotic complexity of rank and select in the grammar-compressed setting, and we provide a hardness result showing that significantly improving the bounds we achieve would imply a major breakthrough on a hard graph-theoretical problem. Our main result for access is a method that requires O(nlogN)O(n\log N) bits of space and O(logN+m/logσN)O(\log N+m/\log_\sigma N) time to extract m=ji+1m=j-i+1 consecutive symbols from SS. Alternatively, we can achieve O(logN/loglogN+m/logσN)O(\log N/\log\log N+m/\log_\sigma N) query time using O(nlog(N/n)(logN)1+ϵ)O(n\log (N/n)(\log N)^{1+\epsilon}) bits of space. This matches a lower bound stated by Verbin and Yu for strings where NN is polynomially related to nn.Comment: 16 page
    corecore