    Universal Indexes for Highly Repetitive Document Collections

    Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    Fully Online Grammar Compression in Constant Space

    We present novel variants of fully online LCA (FOLCA), a fully online grammar compression that builds a straight line program (SLP) and directly encodes it into a succinct representation in an online manner. FOLCA enables a direct encoding of an SLP into a succinct representation that is asymptotically equivalent to an information theoretic lower bound for representing an SLP (Maruyama et al., SPIRE'13). The compression of FOLCA takes linear time proportional to the length of an input text and its working space depends only on the size of the SLP, which enables us to apply FOLCA to large-scale repetitive texts. Recent repetitive texts, however, include some noise. For example, current sequencing technology has significant error rates, which embeds noise into genome sequences. For such noisy repetitive texts, FOLCA working in the SLP size consumes a large amount of memory. We present two variants of FOLCA working in constant space by leveraging the idea behind stream mining techniques. Experiments using 100 human genomes corresponding to about 300GB from the 1000 human genomes project revealed the applicability of our method to large-scale, noisy repetitive texts.Comment: This is an extended version of a proceeding accepted to Data Compression Conference (DCC), 201

    Finger Search in Grammar-Compressed Strings

    Grammar-based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. Given a grammar, the random access problem is to compactly represent the grammar while supporting random access, that is, given a position in the original uncompressed string report the character at that position. In this paper we study the random access problem with the finger search property, that is, the time for a random access query should depend on the distance between a specified index ff, called the \emph{finger}, and the query index ii. We consider both a static variant, where we first place a finger and subsequently access indices near the finger efficiently, and a dynamic variant where also moving the finger such that the time depends on the distance moved is supported. Let nn be the size the grammar, and let NN be the size of the string. For the static variant we give a linear space representation that supports placing the finger in O(log⁥N)O(\log N) time and subsequently accessing in O(log⁥D)O(\log D) time, where DD is the distance between the finger and the accessed index. For the dynamic variant we give a linear space representation that supports placing the finger in O(log⁥N)O(\log N) time and accessing and moving the finger in O(log⁥D+log⁥log⁥N)O(\log D + \log \log N) time. Compared to the best linear space solution to random access, we improve a O(log⁥N)O(\log N) query bound to O(log⁥D)O(\log D) for the static variant and to O(log⁥D+log⁥log⁥N)O(\log D + \log \log N) for the dynamic variant, while maintaining linear space. As an application of our results we obtain an improved solution to the longest common extension problem in grammar compressed strings. To obtain our results, we introduce several new techniques of independent interest, including a novel van Emde Boas style decomposition of grammars

    New Algorithms and Lower Bounds for Sequential-Access Data Compression

    This thesis concerns sequential-access data compression, i.e., by algorithms that read the input one or more times from beginning to end. In one chapter we consider adaptive prefix coding, for which we must read the input character by character, outputting each character's self-delimiting codeword before reading the next one. We show how to encode and decode each character in constant worst-case time while producing an encoding whose length is worst-case optimal. In another chapter we consider one-pass compression with memory bounded in terms of the alphabet size and context length, and prove a nearly tight tradeoff between the amount of memory we can use and the quality of the compression we can achieve. In a third chapter we consider compression in the read/write streams model, which allows us passes and memory both polylogarithmic in the size of the input. We first show how to achieve universal compression using only one pass over one stream. We then show that one stream is not sufficient for achieving good grammar-based compression. Finally, we show that two streams are necessary and sufficient for achieving entropy-only bounds.Comment: draft of PhD thesi

    GraCT: A Grammar based Compressed representation of Trajectories

    We present a compressed data structure to store free trajectories of moving objects (ships over the sea, for example) allowing spatio-temporal queries. Our method, GraCT, uses a k2k^2-tree to store the absolute positions of all objects at regular time intervals (snapshots), whereas the positions between snapshots are represented as logs of relative movements compressed with Re-Pair. Our experimental evaluation shows important savings in space and time with respect to a fair baseline.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    Universal Compressed Text Indexing

    The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let Îł\gamma be the size of a string attractor for a text of length nn. Our index takes O(Îłlog⁥(n/Îł))O(\gamma\log(n/\gamma)) words of space and supports locating the occocc occurrences of any pattern of length mm in O(mlog⁥n+occlogâĄÏ”n)O(m\log n + occ\log^{\epsilon}n) time, for any constant Ï”>0\epsilon>0. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment

    The Unsupervised Acquisition of a Lexicon from Continuous Speech

    We present an unsupervised learning algorithm that acquires a natural-language lexicon from raw speech. The algorithm is based on the optimal encoding of symbol sequences in an MDL framework, and uses a hierarchical representation of language that overcomes many of the problems that have stymied previous grammar-induction procedures. The forward mapping from symbol sequences to the speech stream is modeled using features based on articulatory gestures. We present results on the acquisition of lexicons and language models from raw speech, text, and phonetic transcripts, and demonstrate that our algorithm compares very favorably to other reported results with respect to segmentation performance and statistical efficiency.Comment: 27 page technical repor
