2,270 research outputs found
Finger Search in Grammar-Compressed Strings
Grammar-based compression, where one replaces a long string by a small
context-free grammar that generates the string, is a simple and powerful
paradigm that captures many popular compression schemes. Given a grammar, the
random access problem is to compactly represent the grammar while supporting
random access, that is, given a position in the original uncompressed string
report the character at that position. In this paper we study the random access
problem with the finger search property, that is, the time for a random access
query should depend on the distance between a specified index , called the
\emph{finger}, and the query index . We consider both a static variant,
where we first place a finger and subsequently access indices near the finger
efficiently, and a dynamic variant where also moving the finger such that the
time depends on the distance moved is supported.
Let be the size the grammar, and let be the size of the string. For
the static variant we give a linear space representation that supports placing
the finger in time and subsequently accessing in time,
where is the distance between the finger and the accessed index. For the
dynamic variant we give a linear space representation that supports placing the
finger in time and accessing and moving the finger in time. Compared to the best linear space solution to random
access, we improve a query bound to for the static
variant and to for the dynamic variant, while
maintaining linear space. As an application of our results we obtain an
improved solution to the longest common extension problem in grammar compressed
strings. To obtain our results, we introduce several new techniques of
independent interest, including a novel van Emde Boas style decomposition of
grammars
Universal Compressed Text Indexing
The rise of repetitive datasets has lately generated a lot of interest in
compressed self-indexes based on dictionary compression, a rich and
heterogeneous family that exploits text repetitions in different ways. For each
such compression scheme, several different indexing solutions have been
proposed in the last two decades. To date, the fastest indexes for repetitive
texts are based on the run-length compressed Burrows-Wheeler transform and on
the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on
the other hand, are based on the Lempel-Ziv parsing and on grammar compression.
Indexes for more universal schemes such as collage systems and macro schemes
have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed
that all dictionary compressors can be interpreted as approximation algorithms
for the smallest string attractor, that is, a set of text positions capturing
all distinct substrings. Starting from this observation, in this paper we
develop the first universal compressed self-index, that is, the first indexing
data structure based on string attractors, which can therefore be built on top
of any dictionary-compressed text representation. Let be the size of a
string attractor for a text of length . Our index takes
words of space and supports locating the
occurrences of any pattern of length in
time, for any constant . This is, in particular, the first index
for general macro schemes and collage systems. Our result shows that the
relation between indexing and compression is much deeper than what was
previously thought: the simple property standing at the core of all dictionary
compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment
Linear Compressed Pattern Matching for Polynomial Rewriting (Extended Abstract)
This paper is an extended abstract of an analysis of term rewriting where the
terms in the rewrite rules as well as the term to be rewritten are compressed
by a singleton tree grammar (STG). This form of compression is more general
than node sharing or representing terms as dags since also partial trees
(contexts) can be shared in the compression. In the first part efficient but
complex algorithms for detecting applicability of a rewrite rule under
STG-compression are constructed and analyzed. The second part applies these
results to term rewriting sequences.
The main result for submatching is that finding a redex of a left-linear rule
can be performed in polynomial time under STG-compression.
The main implications for rewriting and (single-position or parallel)
rewriting steps are: (i) under STG-compression, n rewriting steps can be
performed in nondeterministic polynomial time. (ii) under STG-compression and
for left-linear rewrite rules a sequence of n rewriting steps can be performed
in polynomial time, and (iii) for compressed rewrite rules where the left hand
sides are either DAG-compressed or ground and STG-compressed, and an
STG-compressed target term, n rewriting steps can be performed in polynomial
time.Comment: In Proceedings TERMGRAPH 2013, arXiv:1302.599
Fingerprints in Compressed Strings
The Karp-Rabin fingerprint of a string is a type of hash value that due to its strong properties has been used in many string algorithms. In this paper we show how to construct a data structure for a string S of size N compressed by a context-free grammar of size n that answers fingerprint queries. That is, given indices i and j, the answer to a query is the fingerprint of the substring S[i,j]. We present the first O(n) space data structures that answer fingerprint queries without decompressing any characters. For Straight Line Programs (SLP) we get O(logN) query time, and for Linear SLPs (an SLP derivative that captures LZ78 compression and its variations) we get O(log log N) query time. Hence, our data structures has the same time and space complexity as for random access in SLPs. We utilize the fingerprint data structures to solve the longest common extension problem in query time O(log N log l) and O(log l log log l + log log N) for SLPs and Linear SLPs, respectively. Here, l denotes the length of the LCE
Fingerprints in compressed strings
Abstract. The Karp-Rabin fingerprint of a string is a type of hash value that due to its strong properties has been used in many string algorithms. In this paper we show how to construct a data structure for a string S of size N compressed by a context-free grammar of size n that answers fingerprint queries. That is, given indices i and j, the answer to a query is the fingerprint of the substring S[i, j]. We present the first O(n) space data structures that answer fingerprint queries without decompressing any characters. For Straight Line Programs (SLP) we get O(logN) query time, and for Linear SLPs (an SLP derivative that captures LZ78 compression and its variations) we get O(log logN) query time. Hence, our data structures has the same time and space complexity as for random access in SLPs. We utilize the fingerprint data structures to solve the longest common extension problem in query time O(logN log `) and O(log ` log log `+ log logN) for SLPs and Linear SLPs, respectively. Here, ` denotes the length of the LCE.
Compression by Contracting Straight-Line Programs
In grammar-based compression a string is represented by a context-free
grammar, also called a straight-line program (SLP), that generates only that
string. We refine a recent balancing result stating that one can transform an
SLP of size in linear time into an equivalent SLP of size so that
the height of the unique derivation tree is where is the length
of the represented string (FOCS 2019). We introduce a new class of balanced
SLPs, called contracting SLPs, where for every rule the string length of every variable on the right-hand side
is smaller by a constant factor than the string length of . In particular,
the derivation tree of a contracting SLP has the property that every subtree
has logarithmic height in its leaf size. We show that a given SLP of size
can be transformed in linear time into an equivalent contracting SLP of size
with rules of constant length.
We present an application to the navigation problem in compressed unranked
trees, represented by forest straight-line programs (FSLPs). We extend a linear
space data structure by Reh and Sieber (2020) by the operation of moving to the
-th child in time where is the degree of the current node.
Contracting SLPs are also applied to the finger search problem over
SLP-compressed strings where one wants to access positions near to a
pre-specified finger position, ideally in time where is the
distance between the accessed position and the finger. We give a linear space
solution where one can access symbols or move the finger in time for any constant where is the -fold
logarithm of . This improves a previous solution by Bille, Christiansen,
Cording, and G{\o}rtz (2018) with access/move time
Fully dynamic data structure for LCE queries in compressed space
A Longest Common Extension (LCE) query on a text of length asks for
the length of the longest common prefix of suffixes starting at given two
positions. We show that the signature encoding of size [Mehlhorn et al., Algorithmica 17(2):183-198,
1997] of , which can be seen as a compressed representation of , has a
capability to support LCE queries in time,
where is the answer to the query, is the size of the Lempel-Ziv77
(LZ77) factorization of , and is an integer that can be handled
in constant time under word RAM model. In compressed space, this is the fastest
deterministic LCE data structure in many cases. Moreover, can be
enhanced to support efficient update operations: After processing
in time, we can insert/delete any (sub)string of length
into/from an arbitrary position of in time, where . This yields
the first fully dynamic LCE data structure. We also present efficient
construction algorithms from various types of inputs: We can construct
in time from uncompressed string ; in
time from grammar-compressed string
represented by a straight-line program of size ; and in time from LZ77-compressed string with factors. On top
of the above contributions, we show several applications of our data structures
which improve previous best known results on grammar-compressed string
processing.Comment: arXiv admin note: text overlap with arXiv:1504.0695
- …