178 research outputs found

    Relative Lempel-Ziv Compression of Suffix Arrays

    Get PDF
    We show that a combination of differential encoding, random sampling, and relative Lempel-Ziv (RLZ) parsing is effective for compressing suffix arrays, while simultaneously allowing very fast decompression of arbitrary suffix array intervals, facilitating pattern matching. The resulting text index, while somewhat larger (5-10x) than the recent r-index of Gagie, Navarro, and Prezza (Proc. SODA ’18)—still provides significant compression, and allows pattern location queries to be answered more than two orders of magnitude faster in practice.Peer reviewe

    Universal Indexes for Highly Repetitive Document Collections

    Get PDF
    Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    LZ-End Parsing in Linear Time

    Get PDF
    Peer reviewe

    Real-time and distributed applications for dictionary-based data compression

    Get PDF
    The greedy approach to dictionary-based static text compression can be executed by a finite state machine. When it is applied in parallel to different blocks of data independently, there is no lack of robustness even on standard large scale distributed systems with input files of arbitrary size. Beyond standard large scale, a negative effect on the compression effectiveness is caused by the very small size of the data blocks. A robust approach for extreme distributed systems is presented in this paper, where this problem is fixed by overlapping adjacent blocks and preprocessing the neighborhoods of the boundaries. Moreover, we introduce the notion of pseudo-prefix dictionary, which allows optimal compression by means of a real-time semi-greedy procedure and a slight improvement on the compression ratio obtained by the distributed implementations

    Re-Use Dynamic Programming for Sequence Alignment: An Algorithmic Toolkit

    Get PDF
    International audienceThe problem of comparing two sequences S and T to determine their similarity is one of the fundamental problems in pattern matching. In this manuscript we will be primarily concerned with sequences as our objects and with various string comparison metrics. Our goal is to survey a methodology for utilizing repetitions in sequences in order to speed up the comparison process. Within this framework we consider various methods of parsing the sequences in order to frame their repetitions, and present a toolkit of various solutions whose time complexity depends both on the chosen parsing method as well as on the string-comparison metric used for the alignment

    Analysis and Applications of the T-complexity

    Get PDF
    T-codes are variable-length self-synchronizing codes introduced by Titchener in 1984. T-code codewords are constructed recursively from a finite alphabet using an algorithm called T-augmentation, resulting in excellent self-synchronization properties. An algorithm called T-decomposition parses a given sequence into a series of T-prefixes, and finds a T-code set in which the sequence is encoded to a longest codeword. There are similarities and differences between T-decomposition and the conventional LZ78 incremental parsing. The LZ78 incremental parsing algorithm parses a given sequence into consecutive distinct subsequences (words) sequentially in such a way that each word consists of the longest matching word parsed previously and a literal symbol. Then, the LZ-complexity is defined as the number of words. By contrast, T-decomposition parses a given sequence into a series of T-prefixes, each of which consists of the recursive concatenation of the longest matching T-prefix parsed previously and a literal symbol, and it has to access the whole sequence every time it determines a T-prefix. Alike to the LZ-complexity, the T-complexity of a sequence is defined as the number of T-prefixes, however, the T-complexity of a particular sequence in general tends to be smaller than the LZ-complexity. In the first part of the thesis, we deal with our contributions to the theory of T-codes. In order to realize a sequential determination of T-prefixes, we devise a new T-decomposition algorithm using forward parsing. Both the T-complexity profile obtained from the forward T-decomposition and the LZ-complexity profile can be derived in a unified way using a differential equation method. The method focuses on the increase of the average codeword length of a code tree. The obtained formulas are confirmed to coincide with those of previous studies. The magnitude of the T-complexity of a given sequence s in general indicates the degree of randomness. However, there exist interesting sequences that have much larger T-complexities than any random sequences. We investigate the maximum T-complexity sequences and the maximum LZ-complexity sequences using various techniques including those of the test suite released by the National Institute of Standards and Technology (NIST) of the U.S. government, and find that the maximum T-complexity sequences are less random than the maximum LZ-complexity sequences. In the second part of the thesis, we present our achievements in terms of application. We consider two applications -- data compression and randomness testing. First, we propose a new data compression scheme based on T-codes using a dictionary method such that all phrases added to a dictionary have a recursive structure similar to T-codes. Our data compression scheme can compress any of the files in the Calgary Corpus more efficiently than previous schemes based on T-codes and the UNIX compress, a variant of LZ78 (LZW). Next, we introduce a randomness test based on the T-complexity. Recently, the Lempel-Ziv (LZ) randomness test based on the LZ-complexity was officially excluded from the NIST test suite. This is because the distribution of P-values for random sequences of length 106, the most common length used, is strictly discrete in the case of the LZ-complexity. Our test solves this problem because the T-complexity features an almost ideal uniform continuous distribution of P-values for random sequences of length 106. The proposed test outperforms the NIST LZ test, a modified LZ test proposed by Doganaksoy and Göloglu, and all other tests included in the NIST test suite, in terms of the detection of undesirable pseudorandom sequences generated by a multiplicative congruential generator (MCG) and non-random byte sequences Y = Y0, Y1, Y2, ···, where Y3i and Y(3i+1) are random, but Y(3i+2) is given by Y3i + Y(3i+1) mod 28.報告番号: 甲26141 ; 学位授与年月日: 2010-03-24 ; 学位の種別: 課程博士 ; 学位の種類: 博士(科学) ; 学位記番号: 博創域第558号 ; 研究科・専攻: 新領域創成科学研究科基盤科学研究系複雑理工学専
    corecore