14 research outputs found
Cross-Document Pattern Matching
We study a new variant of the string matching problem called cross-document
string matching, which is the problem of indexing a collection of documents to
support an efficient search for a pattern in a selected document, where the
pattern itself is a substring of another document. Several variants of this
problem are considered, and efficient linear-space solutions are proposed with
query time bounds that either do not depend at all on the pattern size or
depend on it in a very limited way (doubly logarithmic). As a side result, we
propose an improved solution to the weighted level ancestor problem
Weighted ancestors in suffix trees
The classical, ubiquitous, predecessor problem is to construct a data
structure for a set of integers that supports fast predecessor queries. Its
generalization to weighted trees, a.k.a. the weighted ancestor problem, has
been extensively explored and successfully reduced to the predecessor problem.
It is known that any solution for both problems with an input set from a
polynomially bounded universe that preprocesses a weighted tree in O(n
polylog(n)) space requires \Omega(loglogn) query time. Perhaps the most
important and frequent application of the weighted ancestors problem is for
suffix trees. It has been a long-standing open question whether the weighted
ancestors problem has better bounds for suffix trees. We answer this question
positively: we show that a suffix tree built for a text w[1..n] can be
preprocessed using O(n) extra space, so that queries can be answered in O(1)
time. Thus we improve the running times of several applications. Our
improvement is based on a number of data structure tools and a
periodicity-based insight into the combinatorial structure of a suffix tree.Comment: 27 pages, LNCS format. A condensed version will appear in ESA 201
Computing Lempel-Ziv Factorization Online
We present an algorithm which computes the Lempel-Ziv factorization of a word
of length on an alphabet of size online in the
following sense: it reads starting from the left, and, after reading each
characters of , updates the Lempel-Ziv
factorization. The algorithm requires bits of space and O(n
\log^2 n) time. The basis of the algorithm is a sparse suffix tree combined
with wavelet trees
Full-fledged Real-Time Indexing for Constant Size Alphabets
In this paper we describe a data structure that supports pattern matching
queries on a dynamically arriving text over an alphabet ofconstant size. Each
new symbol can be prepended to in O(1) worst-case time. At any moment, we
can report all occurrences of a pattern in the current text in
time, where is the length of and is the number of occurrences.
This resolves, under assumption of constant-size alphabet, a long-standing open
problem of existence of a real-time indexing method for string matching (see
\cite{AmirN08})
Internal Pattern Matching Queries in a Text and Applications
We consider several types of internal queries: questions about subwords of a
text. As the main tool we develop an optimal data structure for the problem
called here internal pattern matching. This data structure provides
constant-time answers to queries about occurrences of one subword in
another subword of a given text, assuming that ,
which allows for a constant-space representation of all occurrences. This
problem can be viewed as a natural extension of the well-studied pattern
matching problem. The data structure has linear size and admits a linear-time
construction algorithm.
Using the solution to the internal pattern matching problem, we obtain very
efficient data structures answering queries about: primitivity of subwords,
periods of subwords, general substring compression, and cyclic equivalence of
two subwords. All these results improve upon the best previously known
counterparts. The linear construction time of our data structure also allows to
improve the algorithm for finding -subrepetitions in a text (a more
general version of maximal repetitions, also called runs). For any fixed
we obtain the first linear-time algorithm, which matches the linear
time complexity of the algorithm computing runs. Our data structure has already
been used as a part of the efficient solutions for subword suffix rank &
selection, as well as substring compression using Burrows-Wheeler transform
composed with run-length encoding.Comment: 31 pages, 9 figures; accepted to SODA 201
Fast Algorithm for Partial Covers in Words
A factor of a word is a cover of if every position in lies
within some occurrence of in . A word covered by thus
generalizes the idea of a repetition, that is, a word composed of exact
concatenations of . In this article we introduce a new notion of
-partial cover, which can be viewed as a relaxed variant of cover, that
is, a factor covering at least positions in . We develop a data
structure of size (where ) that can be constructed in time which we apply to compute all shortest -partial covers for a
given . We also employ it for an -time algorithm computing
a shortest -partial cover for each
On Optimal Top-K String Retrieval
Let = be a given set of
(string) documents of total length . The top- document retrieval problem
is to index such that when a pattern of length , and a
parameter come as a query, the index returns the most relevant
documents to the pattern . Hon et. al. \cite{HSV09} gave the first linear
space framework to solve this problem in time. This was
improved by Navarro and Nekrich \cite{NN12} to . These results are
powerful enough to support arbitrary relevance functions like frequency,
proximity, PageRank, etc. In many applications like desktop or email search,
the data resides on disk and hence disk-bound indexes are needed. Despite of
continued progress on this problem in terms of theoretical, practical and
compression aspects, any non-trivial bounds in external memory model have so
far been elusive. Internal memory (or RAM) solution to this problem decomposes
the problem into subproblems and thus incurs the additive factor of
. In external memory, these approaches will lead to I/Os instead
of optimal I/O term where is the block-size. We re-interpret the
problem independent of , as interval stabbing with priority over tree-shaped
structure. This leads us to a linear space index in external memory supporting
top- queries (with unsorted outputs) in near optimal I/Os for any constant { and
}. Then we get space index
with optimal I/Os.Comment: 3 figure
The Online House Numbering Problem: Min-Max Online List Labeling
We introduce and study the online house numbering problem, where houses are added arbitrarily along a road and must be assigned labels to maintain their ordering along the road. The online house numbering problem is related to classic online list labeling problems, except that the optimization goal here is to minimize the maximum number of times that any house is relabeled. We provide several algorithms that achieve interesting tradeoffs between upper bounds on the number of maximum relabels per element and the number of bits used by labels