6 research outputs found

    On Sensitivity of Compact Directed Acyclic Word Graphs

    Full text link
    Compact directed acyclic word graphs (CDAWGs) [Blumer et al. 1987] are a fundamental data structure on strings with applications in text pattern searching, data compression, and pattern discovery. Intuitively, the CDAWG of a string TT is obtained by merging isomorphic subtrees of the suffix tree [Weiner 1973] of the same string TT, thus CDAWGs are a compact indexing structure. In this paper, we investigate the sensitivity of CDAWGs when a single character edit operation (insertion, deletion, or substitution) is performed at the left-end of the input string TT, namely, we are interested in the worst-case increase in the size of the CDAWG after a left-end edit operation. We prove that if ee is the number of edges of the CDAWG for string TT, then the number of new edges added to the CDAWG after a left-end edit operation on TT is less than ee. Further, we present almost matching lower bounds on the sensitivity of CDAWGs for all cases of insertion, deletion, and substitution.Comment: This is a full version of the paper accepted for WORDS 202

    The Rightmost Equal-Cost Position Problem

    Full text link
    LZ77-based compression schemes compress the input text by replacing factors in the text with an encoded reference to a previous occurrence formed by the couple (length, offset). For a given factor, the smallest is the offset, the smallest is the resulting compression ratio. This is optimally achieved by using the rightmost occurrence of a factor in the previous text. Given a cost function, for instance the minimum number of bits used to represent an integer, we define the Rightmost Equal-Cost Position (REP) problem as the problem of finding one of the occurrences of a factor which cost is equal to the cost of the rightmost one. We present the Multi-Layer Suffix Tree data structure that, for a text of length n, at any time i, it provides REP(LPF) in constant time, where LPF is the longest previous factor, i.e. the greedy phrase, a reference to the list of REP({set of prefixes of LPF}) in constant time and REP(p) in time O(|p| log log n) for any given pattern p

    Sliding Window String Indexing in Streams

    Get PDF
    Given a string S over an alphabet ?, the string indexing problem is to preprocess S to subsequently support efficient pattern matching queries, that is, given a pattern string P report all the occurrences of P in S. In this paper we study the streaming sliding window string indexing problem. Here the string S arrives as a stream, one character at a time, and the goal is to maintain an index of the last w characters, called the window, for a specified parameter w. At any point in time a pattern matching query for a pattern P may arrive, also streamed one character at a time, and all occurrences of P within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching. Our main result is a simple O(w) space data structure that uses O(log w) time with high probability to process each character from both the input string S and any pattern string P. Reporting each occurrence of P uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream with high probability. We also consider a delayed variant of the problem, where a query may be answered at any point within the next ? characters that arrive from either stream. We present an O(w + ?) space data structure for this problem that improves the above time bounds to O(log (w/?)). In particular, for a delay of ? = ? w we obtain an O(w) space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees

    Sliding Window String Indexing in Streams

    Full text link
    Given a string SS over an alphabet Σ\Sigma, the 'string indexing problem' is to preprocess SS to subsequently support efficient pattern matching queries, i.e., given a pattern string PP report all the occurrences of PP in SS. In this paper we study the 'streaming sliding window string indexing problem'. Here the string SS arrives as a stream, one character at a time, and the goal is to maintain an index of the last ww characters, called the 'window', for a specified parameter ww. At any point in time a pattern matching query for a pattern PP may arrive, also streamed one character at a time, and all occurrences of PP within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching. Our main result is a simple O(w)O(w) space data structure that uses O(logw)O(\log w) time with high probability to process each character from both the input string SS and the pattern string PP. Reporting each occurrence from PP uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream. We also consider a delayed variant of the problem, where a query may be answered at any point within the next δ\delta characters that arrive from either stream. We present an O(w+δ)O(w + \delta) space data structure for this problem that improves the above time bounds to O(log(w/δ))O(\log(w/\delta)). In particular, for a delay of δ=ϵw\delta = \epsilon w we obtain an O(w)O(w) space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees

    Maintaining the size of LZ77 on semi-dynamic strings

    Get PDF
    We consider the problem of maintaining the size of the LZ77 factorization of a string S of length at most n under the following operations: (a) appending a given letter to S and (b) deleting the first letter of S. Our main result is an algorithm for this problem with amortized update time Õ(√n). As a corollary, we obtain an Õ(n√n)-time algorithm for computing the most LZ77-compressible rotation of a length-n string - a naive approach for this problem would compute the LZ77 factorization of each possible rotation and would thus take quadratic time in the worst case. We also show an Ω(√n) lower bound for the additive sensitivity of LZ77 with respect to the rotation operation. Our algorithm employs dynamic trees to maintain the longest-previous-factor array information and depends on periodicity-based arguments that bound the number of the required updates and enable their efficient computation

    Optimal Parsing for Dictionary Text Compression

    Get PDF
    Dictionary-based compression algorithms include a parsing strategy to transform the input text into a sequence of dictionary phrases. Given a text, such process usually is not unique and, for compression purpose, it makes sense to find one of the possible parsing that minimize the final compression ratio. This is the parsing problem. An optimal parsing is a parsing strategy or a parsing algorithm that solve the parsing problem taking account of all the constraints of a compression algorithm or of a class of homogeneous compression algorithms. Compression algorithm constrains are, for instance, the dictionary itself, i.e. the dynamic set of available phrases, and how much a phrase weights on the compressed text, i.e. the number of bits of which the codeword representing such phrase is composed, also denoted as the encoding cost of a dictionary pointer. In more than 30th years of history of dictionary based text compression, while plenty of algorithms, variants and extensions appeared and while dictionary approach to text compression became one of the most appreciated and utilized in almost all the storage and communication processes, only few optimal parsing algorithms were presented. Many compression algorithms still leaks optimality of their parsing or, at least, proof of optimality. This happens because there is not a general model of the parsing problem that includes all the dictionary based algorithms and because the existing optimal parsing algorithms work under too restrictive hypothesis. This work focus on the parsing problem and presents both a general model for dictionary based text compression called Dictionary-Symbolwise Text Compression theory and a general parsing algorithm that is proved to be optimal under some realistic hypothesis. This algorithm is called iii Dictionary-Symbolwise Flexible Parsing and it covers almost all of the known cases of dictionary based text compression algorithms together with the large class of their variants where the text is decomposed in a sequence of symbols and dictionary phrases. In this work we further consider the case of a free mixture of a dictionary compressor and a symbolwise compressor. Our Dictionary-Symbolwise Flexible Parsing covers also this case. We have indeed an optimal parsing algorithm in the case of dictionary-symbolwise compression where the dictionary is prefix closed and the cost of encoding dictionary pointer is variable. The symbolwise compressor is any classical one that works in linear time, as many common variable-length encoders do. Our algorithm works under the assumption that a special graph that will be described in the following, is well defined. Even if this condition is not satisfied, it is possible to use the same method to obtain almost optimal parses. In detail, when the dictionary is LZ78-like, we show how to implement our algorithm in linear time. When the dictionary is LZ77-like our algorithm can be implemented in time O(n log n). Both have O(n) space complexity. Even if the main aim of this work is of theoretical nature, some experimental results will be introduced to underline some practical effects of the parsing optimality in terms of compression performance and to show how to improve the compression ratio by building extensions Dictionary- Symbolwise of known algorithms. Finally, some more detailed experiments are hosted in a devoted appendix
    corecore