Search CORE

14 research outputs found

Cross-Document Pattern Matching

Author: A. Andersson
J.L. Bentley
K. Sadakane
K. Sadakane
M. Farach
M.A. Bender
M.A. Bender
M.A. Bender
M.L. Fredman
O. Berkman
P. Bozanis
P. Dietz
R. Grossi
S. Muthukrishnan
T. Gagie
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

We study a new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic). As a side result, we propose an improved solution to the weighted level ancestor problem

arXiv.org e-Print Archive

CiteSeerX

Crossref

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Weighted ancestors in suffix trees

Author: D.E. Willard
M. Farach
M.A. Bender
O. Berkman
P. Bille
P. Gawrychowski
T. Kopelowitz
Publication venue
Publication date: 01/01/2014
Field of study

The classical, ubiquitous, predecessor problem is to construct a data structure for a set of integers that supports fast predecessor queries. Its generalization to weighted trees, a.k.a. the weighted ancestor problem, has been extensively explored and successfully reduced to the predecessor problem. It is known that any solution for both problems with an input set from a polynomially bounded universe that preprocesses a weighted tree in O(n polylog(n)) space requires \Omega(loglogn) query time. Perhaps the most important and frequent application of the weighted ancestors problem is for suffix trees. It has been a long-standing open question whether the weighted ancestors problem has better bounds for suffix trees. We answer this question positively: we show that a suffix tree built for a text w[1..n] can be preprocessed using O(n) extra space, so that queries can be answered in O(1) time. Thus we improve the running times of several applications. Our improvement is based on a number of data structure tools and a periodicity-based insight into the combinatorial structure of a suffix tree.Comment: 27 pages, LNCS format. A condensed version will appear in ESA 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Computing Lempel-Ziv Factorization Online

Author: Starikovskaya Tatiana
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

We present an algorithm which computes the Lempel-Ziv factorization of a word

W

of length

n

on an alphabet

\Sigma

of size

\sigma

online in the following sense: it reads

W

starting from the left, and, after reading each

r = O(\log_{\sigma} n)

characters of

W

, updates the Lempel-Ziv factorization. The algorithm requires

O(n \log \sigma)

bits of space and O(n \log^2 n) time. The basis of the algorithm is a sparse suffix tree combined with wavelet trees

arXiv.org e-Print Archive

CiteSeerX

Crossref

Full-fledged Real-Time Indexing for Constant Size Alphabets

Author: Kucherov Gregory
Nekrich Yakov
Publication venue
Publication date: 06/07/2013
Field of study

In this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet ofconstant size. Each new symbol can be prepended to

T

in O(1) worst-case time. At any moment, we can report all occurrences of a pattern

P

in the current text in

O(|P|+k)

time, where

|P|

is the length of

P

and

k

is the number of occurrences. This resolves, under assumption of constant-size alphabet, a long-standing open problem of existence of a real-time indexing method for string matching (see \cite{AmirN08})

arXiv.org e-Print Archive

HAL Descartes

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Internal Pattern Matching Queries in a Text and Applications

Author: Kociumaka Tomasz
Radoszewski Jakub
Rytter Wojciech
Waleń Tomasz
Publication venue
Publication date: 13/10/2014
Field of study

We consider several types of internal queries: questions about subwords of a text. As the main tool we develop an optimal data structure for the problem called here internal pattern matching. This data structure provides constant-time answers to queries about occurrences of one subword

x

in another subword

y

of a given text, assuming that

|y|=\mathcal{O}(|x|)

, which allows for a constant-space representation of all occurrences. This problem can be viewed as a natural extension of the well-studied pattern matching problem. The data structure has linear size and admits a linear-time construction algorithm. Using the solution to the internal pattern matching problem, we obtain very efficient data structures answering queries about: primitivity of subwords, periods of subwords, general substring compression, and cyclic equivalence of two subwords. All these results improve upon the best previously known counterparts. The linear construction time of our data structure also allows to improve the algorithm for finding

\delta

-subrepetitions in a text (a more general version of maximal repetitions, also called runs). For any fixed

\delta

we obtain the first linear-time algorithm, which matches the linear time complexity of the algorithm computing runs. Our data structure has already been used as a part of the efficient solutions for subword suffix rank & selection, as well as substring compression using Burrows-Wheeler transform composed with run-length encoding.Comment: 31 pages, 9 figures; accepted to SODA 201

arXiv.org e-Print Archive

Crossref

Fast Algorithm for Partial Covers in Words

Author: A. Apostolico
A. Apostolico
A. Apostolico
A.S. Fraenkel
D. Breslauer
D. Gusfield
D. Moore
E. Ukkonen
G.S. Brodal
G.S. Brodal
J.S. Sim
M. Crochemore
M.R. Brown
Y. Li
Publication venue
Publication date: 01/01/2013
Field of study

A factor

u

of a word

w

is a cover of

w

if every position in

w

lies within some occurrence of

u

w

. A word

w

covered by

u

thus generalizes the idea of a repetition, that is, a word composed of exact concatenations of

u

. In this article we introduce a new notion of

\alpha

-partial cover, which can be viewed as a relaxed variant of cover, that is, a factor covering at least

\alpha

positions in

w

. We develop a data structure of

O(n)

size (where

n=|w|

) that can be constructed in

O(n\log n)

time which we apply to compute all shortest

\alpha

-partial covers for a given

\alpha

. We also employ it for an

O(n\log n)

-time algorithm computing a shortest

\alpha

-partial cover for each

\alpha=1,2,\ldots,n

arXiv.org e-Print Archive

Crossref

Springer - Publisher Connector

King's Research Portal

On Optimal Top-K String Retrieval

Author: Shah Rahul
Sheng Cheng
Thankachan Sharma V.
Vitter Jeffrey Scott
Publication venue
Publication date: 01/01/2012
Field of study

Let

{\cal{D}}

\{d_1, d_2, d_3, ..., d_D\}

be a given set of

D

(string) documents of total length

n

. The top-

k

document retrieval problem is to index

\cal{D}

such that when a pattern

P

of length

p

, and a parameter

k

come as a query, the index returns the

k

most relevant documents to the pattern

P

. Hon et. al. \cite{HSV09} gave the first linear space framework to solve this problem in

O(p + k\log k)

time. This was improved by Navarro and Nekrich \cite{NN12} to

O(p + k)

. These results are powerful enough to support arbitrary relevance functions like frequency, proximity, PageRank, etc. In many applications like desktop or email search, the data resides on disk and hence disk-bound indexes are needed. Despite of continued progress on this problem in terms of theoretical, practical and compression aspects, any non-trivial bounds in external memory model have so far been elusive. Internal memory (or RAM) solution to this problem decomposes the problem into

O(p)

subproblems and thus incurs the additive factor of

O(p)

. In external memory, these approaches will lead to

O(p)

I/Os instead of optimal

O(p/B)

I/O term where

B

is the block-size. We re-interpret the problem independent of

p

, as interval stabbing with priority over tree-shaped structure. This leads us to a linear space index in external memory supporting top-

k

queries (with unsorted outputs) in near optimal

O(p/B + \log_B n + \log^{(h)} n + k/B)

I/Os for any constant

h

{

\log^{(1)}n =\log n

and

\log^{(h)} n = \log (\log^{(h-1)} n)

}. Then we get

O(n\log^*n)

space index with optimal

O(p/B+\log_B n + k/B)

I/Os.Comment: 3 figure

arXiv.org e-Print Archive

CiteSeerX

The Online House Numbering Problem: Min-Max Online List Labeling

Author: Devanny William E.
Fineman Jeremy T.
Goodrich Michael T.
Kopelowitz Tsvi
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 25th Annual European Symposium on Algorithms (ESA 2017)
Publication date: 01/01/2017
Field of study

We introduce and study the online house numbering problem, where houses are added arbitrarily along a road and must be assigned labels to maintain their ordering along the road. The online house numbering problem is related to classic online list labeling problems, except that the optimization goal here is to minimize the maximum number of times that any house is relabeled. We provide several algorithms that achieve interesting tradeoffs between upper bounds on the number of maximum relabels per element and the number of bits used by labels

Dagstuhl Research Online Publication Server

Full-Fledged Real-Time Indexing for Constant Size Alphabets

Author: DE Willard
Gregory Kucherov
ML Fredman
P van Emde Boas
R Cole
Yakov Nekrich
Z Galil
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref