240 research outputs found
Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation
Given a static reference string and a source string , a relative
compression of with respect to is an encoding of as a sequence of
references to substrings of . Relative compression schemes are a classic
model of compression and have recently proved very successful for compressing
highly-repetitive massive data sets such as genomes and web-data. We initiate
the study of relative compression in a dynamic setting where the compressed
source string is subject to edit operations. The goal is to maintain the
compressed representation compactly, while supporting edits and allowing
efficient random access to the (uncompressed) source string. We present new
data structures that achieve optimal time for updates and queries while using
space linear in the size of the optimal relative compression, for nearly all
combinations of parameters. We also present solutions for restricted and
extended sets of updates. To achieve these results, we revisit the dynamic
partial sums problem and the substring concatenation problem. We present new
optimal or near optimal bounds for these problems. Plugging in our new results
we also immediately obtain new bounds for the string indexing for patterns with
wildcards problem and the dynamic text and static pattern matching problem
Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation
Given a static reference string R and a source string S, a relative compression of S with respect to R is an encoding of S as a sequence of references to substrings of R. Relative compression schemes are a classic model of compression and have recently proved very successful for compressing highly-repetitive massive data sets such as genomes and web-data. We initiate the study of relative compression in a dynamic setting where the compressed source string S is subject to edit operations. The goal is to maintain the compressed representation compactly, while supporting edits and allowing efficient random access to the (uncompressed) source string. We present new data structures that achieve optimal time for updates and queries while using space linear in the size of the optimal relative compression, for nearly all combinations of parameters. We also present solutions for restricted and extended sets of updates. To achieve these results, we revisit the dynamic partial sums problem and the substring concatenation problem. We present new optimal or near optimal bounds for these problems. Plugging in our new results we also immediately obtain new bounds for the string indexing for patterns with wildcards problem and the dynamic text and static pattern matching problem
A Faster Implementation of Online Run-Length Burrows-Wheeler Transform
Run-length encoding Burrows-Wheeler Transformed strings, resulting in
Run-Length BWT (RLBWT), is a powerful tool for processing highly repetitive
strings. We propose a new algorithm for online RLBWT working in run-compressed
space, which runs in time and bits of space, where
is the length of input string received so far and is the number of runs
in the BWT of the reversed . We improve the state-of-the-art algorithm for
online RLBWT in terms of empirical construction time. Adopting the dynamic list
for maintaining a total order, we can replace rank queries in a dynamic wavelet
tree on a run-length compressed string by the direct comparison of labels in a
dynamic list. The empirical result for various benchmarks show the efficiency
of our algorithm, especially for highly repetitive strings.Comment: In Proc. IWOCA201
Random Access in Persistent Strings and Segment Selection
We consider compact representations of collections of similar strings that
support random access queries. The collection of strings is given by a rooted
tree where edges are labeled by an edit operation (inserting, deleting, or
replacing a character) and a node represents the string obtained by applying
the sequence of edit operations on the path from the root to the node. The goal
is to compactly represent the entire collection while supporting fast random
access to any part of a string in the collection. This problem captures natural
scenarios such as representing the past history of an edited document or
representing highly-repetitive collections. Given a tree with nodes, we
show how to represent the corresponding collection in space and query time. This improves the previous time-space trade-offs
for the problem. Additionally, we show a lower bound proving that the query
time is optimal for any solution using near-linear space.
To achieve our bounds for random access in persistent strings we show how to
reduce the problem to the following natural geometric selection problem on line
segments. Consider a set of horizontal line segments in the plane. Given
parameters and , a segment selection query returns the th smallest
segment (the segment with the th smallest -coordinate) among the segments
crossing the vertical line through -coordinate . The segment selection
problem is to preprocess a set of horizontal line segments into a compact data
structure that supports fast segment selection queries. We present a solution
that uses space and support segment selection queries in time, where is the number of segments. Furthermore, we prove that
that this query time is also optimal for any solution using near-linear space.Comment: Extended abstract at ISAAC 202
Update Query Time Trade-Off for Dynamic Suffix Arrays
The Suffix Array SA(S) of a string S[1 ... n] is an array containing all the
suffixes of S sorted by lexicographic order. The suffix array is one of the
most well known indexing data structures, and it functions as a key tool in
many string algorithms. In this paper, we present a data structure for
maintaining the Suffix Array of a dynamic string. For every , our data structure reports SA[i] in time
and handles text modification in time.
Additionally, our data structure enables the same query time for reporting
iSA[i], with iSA being the Inverse Suffix Array of S[1 ... n]. Our data
structure can be used to construct sub-linear dynamic variants of static
strings algorithms or data structures that are based on the Suffix Array and
the Inverse Suffix Array.Comment: 19 pages, 3 figure
- âŚ