9 research outputs found
Full-fledged Real-Time Indexing for Constant Size Alphabets
In this paper we describe a data structure that supports pattern matching
queries on a dynamically arriving text over an alphabet ofconstant size. Each
new symbol can be prepended to in O(1) worst-case time. At any moment, we
can report all occurrences of a pattern in the current text in
time, where is the length of and is the number of occurrences.
This resolves, under assumption of constant-size alphabet, a long-standing open
problem of existence of a real-time indexing method for string matching (see
\cite{AmirN08})
Sufficient Conditions for Efficient Indexing Under Different Matchings
The most important task derived from the massive digital data accumulation in the world, is efficient access to this data, hence the importance of indexing. In the last decade, many different types of matching relations were defined, each requiring an efficient indexing scheme. Cole and Hariharan in a ground breaking paper [Cole and Hariharan, SIAM J. Comput., 33(1):26-42, 2003], formulate sufficient conditions for building an efficient indexing for quasi-suffix collections, collections that behave as suffixes. It was shown that known matchings, including parameterized, 2-D array and order preserving matchings, fit their indexing settings. In this paper, we formulate more basic sufficient conditions based on the order relation derived from the matching relation itself, our conditions are more general than the previously known conditions
Managing Unbounded-Length Keys in Comparison-Driven Data Structures with Applications to On-Line Indexing
This paper presents a general technique for optimally transforming any
dynamic data structure that operates on atomic and indivisible keys by
constant-time comparisons, into a data structure that handles unbounded-length
keys whose comparison cost is not a constant. Examples of these keys are
strings, multi-dimensional points, multiple-precision numbers, multi-key data
(e.g.~records), XML paths, URL addresses, etc. The technique is more general
than what has been done in previous work as no particular exploitation of the
underlying structure of is required. The only requirement is that the insertion
of a key must identify its predecessor or its successor.
Using the proposed technique, online suffix tree can be constructed in worst
case time per input symbol (as opposed to amortized
time per symbol, achieved by previously known algorithms). To our knowledge,
our algorithm is the first that achieves worst case time per input
symbol. Searching for a pattern of length in the resulting suffix tree
takes time, where is the
number of occurrences of the pattern. The paper also describes more
applications and show how to obtain alternative methods for dealing with suffix
sorting, dynamic lowest common ancestors and order maintenance
Sliding Window String Indexing in Streams
Given a string over an alphabet , the 'string indexing problem'
is to preprocess to subsequently support efficient pattern matching
queries, i.e., given a pattern string report all the occurrences of in
. In this paper we study the 'streaming sliding window string indexing
problem'. Here the string arrives as a stream, one character at a time, and
the goal is to maintain an index of the last characters, called the
'window', for a specified parameter . At any point in time a pattern
matching query for a pattern may arrive, also streamed one character at a
time, and all occurrences of within the current window must be returned.
The streaming sliding window string indexing problem naturally captures
scenarios where we want to index the most recent data (i.e. the window) of a
stream while supporting efficient pattern matching.
Our main result is a simple space data structure that uses
time with high probability to process each character from both the input string
and the pattern string . Reporting each occurrence from uses
additional constant time per reported occurrence. Compared to previous work in
similar scenarios this result is the first to achieve an efficient worst-case
time per character from the input stream. We also consider a delayed variant of
the problem, where a query may be answered at any point within the next
characters that arrive from either stream. We present an space data structure for this problem that improves the above time
bounds to . In particular, for a delay of we obtain an space data structure with constant time processing per
character. The key idea to achieve our result is a novel and simple
hierarchical structure of suffix trees of independent interest, inspired by the
classic log-structured merge trees
Sliding Window String Indexing in Streams
Given a string S over an alphabet ?, the string indexing problem is to preprocess S to subsequently support efficient pattern matching queries, that is, given a pattern string P report all the occurrences of P in S. In this paper we study the streaming sliding window string indexing problem. Here the string S arrives as a stream, one character at a time, and the goal is to maintain an index of the last w characters, called the window, for a specified parameter w. At any point in time a pattern matching query for a pattern P may arrive, also streamed one character at a time, and all occurrences of P within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching.
Our main result is a simple O(w) space data structure that uses O(log w) time with high probability to process each character from both the input string S and any pattern string P. Reporting each occurrence of P uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream with high probability. We also consider a delayed variant of the problem, where a query may be answered at any point within the next ? characters that arrive from either stream. We present an O(w + ?) space data structure for this problem that improves the above time bounds to O(log (w/?)). In particular, for a delay of ? = ? w we obtain an O(w) space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees
Locally Consistent Parsing for Text Indexing in Small Space
We consider two closely related problems of text indexing in a sub-linear
working space. The first problem is the Sparse Suffix Tree (SST) construction
of a set of suffixes using only words of space. The second problem
is the Longest Common Extension (LCE) problem, where for some parameter
, the goal is to construct a data structure that uses words of space and can compute the longest common prefix length of
any pair of suffixes. We show how to use ideas based on the Locally Consistent
Parsing technique, that was introduced by Sahinalp and Vishkin [STOC '94], in
some non-trivial ways in order to improve the known results for the above
problems. We introduce new Las-Vegas and deterministic algorithms for both
problems.
We introduce the first Las-Vegas SST construction algorithm that takes
time. This is an improvement over the last result of Gawrychowski and Kociumaka
[SODA '17] who obtained time for Monte-Carlo algorithm, and
time for Las-Vegas algorithm. In addition, we introduce a
randomized Las-Vegas construction for an LCE data structure that can be
constructed in linear time and answers queries in time.
For the deterministic algorithms, we introduce an SST construction algorithm
that takes time (for ). This is
the first almost linear time, , deterministic SST
construction algorithm, where all previous algorithms take at least
time. For the LCE problem, we
introduce a data structure that answers LCE queries in
time, with construction time (for ).
This data structure improves both query time and construction time upon the
results of Tanimura et al. [CPM '16].Comment: Extended abstract to appear is SODA 202
Full-Fledged Real-Time Indexing for Constant Size Alphabets
International audienceIn this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet of constant size. Each new symbol can be prepended to T in O(1) worst-case time. At any moment, we can report all occurrences of a pattern P in the current text in O(|P|+k) time, where |P| is the length of P and k is the number of occurrences. This resolves, under assumption of constant size alphabet, a long-standing open problem of existence of a real-time indexing method for string matching (see Amir and Nor in Real-time indexing over fixed finite alphabets, pp. 1086–1095, 2008)