6 research outputs found
On Sensitivity of Compact Directed Acyclic Word Graphs
Compact directed acyclic word graphs (CDAWGs) [Blumer et al. 1987] are a
fundamental data structure on strings with applications in text pattern
searching, data compression, and pattern discovery. Intuitively, the CDAWG of a
string is obtained by merging isomorphic subtrees of the suffix tree
[Weiner 1973] of the same string , thus CDAWGs are a compact indexing
structure. In this paper, we investigate the sensitivity of CDAWGs when a
single character edit operation (insertion, deletion, or substitution) is
performed at the left-end of the input string , namely, we are interested in
the worst-case increase in the size of the CDAWG after a left-end edit
operation. We prove that if is the number of edges of the CDAWG for string
, then the number of new edges added to the CDAWG after a left-end edit
operation on is less than . Further, we present almost matching lower
bounds on the sensitivity of CDAWGs for all cases of insertion, deletion, and
substitution.Comment: This is a full version of the paper accepted for WORDS 202
The Rightmost Equal-Cost Position Problem
LZ77-based compression schemes compress the input text by replacing factors
in the text with an encoded reference to a previous occurrence formed by the
couple (length, offset). For a given factor, the smallest is the offset, the
smallest is the resulting compression ratio. This is optimally achieved by
using the rightmost occurrence of a factor in the previous text. Given a cost
function, for instance the minimum number of bits used to represent an integer,
we define the Rightmost Equal-Cost Position (REP) problem as the problem of
finding one of the occurrences of a factor which cost is equal to the cost of
the rightmost one. We present the Multi-Layer Suffix Tree data structure that,
for a text of length n, at any time i, it provides REP(LPF) in constant time,
where LPF is the longest previous factor, i.e. the greedy phrase, a reference
to the list of REP({set of prefixes of LPF}) in constant time and REP(p) in
time O(|p| log log n) for any given pattern p
Sliding Window String Indexing in Streams
Given a string S over an alphabet ?, the string indexing problem is to preprocess S to subsequently support efficient pattern matching queries, that is, given a pattern string P report all the occurrences of P in S. In this paper we study the streaming sliding window string indexing problem. Here the string S arrives as a stream, one character at a time, and the goal is to maintain an index of the last w characters, called the window, for a specified parameter w. At any point in time a pattern matching query for a pattern P may arrive, also streamed one character at a time, and all occurrences of P within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching.
Our main result is a simple O(w) space data structure that uses O(log w) time with high probability to process each character from both the input string S and any pattern string P. Reporting each occurrence of P uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream with high probability. We also consider a delayed variant of the problem, where a query may be answered at any point within the next ? characters that arrive from either stream. We present an O(w + ?) space data structure for this problem that improves the above time bounds to O(log (w/?)). In particular, for a delay of ? = ? w we obtain an O(w) space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees
Sliding Window String Indexing in Streams
Given a string over an alphabet , the 'string indexing problem'
is to preprocess to subsequently support efficient pattern matching
queries, i.e., given a pattern string report all the occurrences of in
. In this paper we study the 'streaming sliding window string indexing
problem'. Here the string arrives as a stream, one character at a time, and
the goal is to maintain an index of the last characters, called the
'window', for a specified parameter . At any point in time a pattern
matching query for a pattern may arrive, also streamed one character at a
time, and all occurrences of within the current window must be returned.
The streaming sliding window string indexing problem naturally captures
scenarios where we want to index the most recent data (i.e. the window) of a
stream while supporting efficient pattern matching.
Our main result is a simple space data structure that uses
time with high probability to process each character from both the input string
and the pattern string . Reporting each occurrence from uses
additional constant time per reported occurrence. Compared to previous work in
similar scenarios this result is the first to achieve an efficient worst-case
time per character from the input stream. We also consider a delayed variant of
the problem, where a query may be answered at any point within the next
characters that arrive from either stream. We present an space data structure for this problem that improves the above time
bounds to . In particular, for a delay of we obtain an space data structure with constant time processing per
character. The key idea to achieve our result is a novel and simple
hierarchical structure of suffix trees of independent interest, inspired by the
classic log-structured merge trees
Maintaining the size of LZ77 on semi-dynamic strings
We consider the problem of maintaining the size of the LZ77 factorization of a string S of length at most n under the following operations: (a) appending a given letter to S and (b) deleting the first letter of S. Our main result is an algorithm for this problem with amortized update time Õ(√n). As a corollary, we obtain an Õ(n√n)-time algorithm for computing the most LZ77-compressible rotation of a length-n string - a naive approach for this problem would compute the LZ77 factorization of each possible rotation and would thus take quadratic time in the worst case. We also show an Ω(√n) lower bound for the additive sensitivity of LZ77 with respect to the rotation operation. Our algorithm employs dynamic trees to maintain the longest-previous-factor array information and depends on periodicity-based arguments that bound the number of the required updates and enable their efficient computation
Optimal Parsing for Dictionary Text Compression
Dictionary-based compression algorithms include a parsing strategy to
transform the input text into a sequence of dictionary phrases. Given a text,
such process usually is not unique and, for compression purpose, it makes
sense to find one of the possible parsing that minimize the final compression
ratio. This is the parsing problem. An optimal parsing is a parsing strategy
or a parsing algorithm that solve the parsing problem taking account of
all the constraints of a compression algorithm or of a class of homogeneous
compression algorithms. Compression algorithm constrains are, for instance,
the dictionary itself, i.e. the dynamic set of available phrases, and how much
a phrase weights on the compressed text, i.e. the number of bits of which
the codeword representing such phrase is composed, also denoted as the
encoding cost of a dictionary pointer.
In more than 30th years of history of dictionary based text compression,
while plenty of algorithms, variants and extensions appeared and while dictionary
approach to text compression became one of the most appreciated
and utilized in almost all the storage and communication processes, only few
optimal parsing algorithms were presented. Many compression algorithms
still leaks optimality of their parsing or, at least, proof of optimality. This
happens because there is not a general model of the parsing problem that includes
all the dictionary based algorithms and because the existing optimal
parsing algorithms work under too restrictive hypothesis.
This work focus on the parsing problem and presents both a general
model for dictionary based text compression called Dictionary-Symbolwise
Text Compression theory and a general parsing algorithm that is proved
to be optimal under some realistic hypothesis. This algorithm is called
iii
Dictionary-Symbolwise Flexible Parsing and it covers almost all of the known
cases of dictionary based text compression algorithms together with the large
class of their variants where the text is decomposed in a sequence of symbols
and dictionary phrases.
In this work we further consider the case of a free mixture of a dictionary
compressor and a symbolwise compressor. Our Dictionary-Symbolwise
Flexible Parsing covers also this case. We have indeed an optimal parsing
algorithm in the case of dictionary-symbolwise compression where the dictionary
is prefix closed and the cost of encoding dictionary pointer is variable.
The symbolwise compressor is any classical one that works in linear time, as
many common variable-length encoders do. Our algorithm works under the
assumption that a special graph that will be described in the following, is
well defined. Even if this condition is not satisfied, it is possible to use the
same method to obtain almost optimal parses. In detail, when the dictionary
is LZ78-like, we show how to implement our algorithm in linear time.
When the dictionary is LZ77-like our algorithm can be implemented in time
O(n log n). Both have O(n) space complexity.
Even if the main aim of this work is of theoretical nature, some experimental
results will be introduced to underline some practical effects of
the parsing optimality in terms of compression performance and to show
how to improve the compression ratio by building extensions Dictionary-
Symbolwise of known algorithms. Finally, some more detailed experiments
are hosted in a devoted appendix