Search CORE

1,442 research outputs found

Weighted ancestors in suffix trees

Author: D.E. Willard
M. Farach
M.A. Bender
O. Berkman
P. Bille
P. Gawrychowski
T. Kopelowitz
Publication venue
Publication date: 01/01/2014
Field of study

The classical, ubiquitous, predecessor problem is to construct a data structure for a set of integers that supports fast predecessor queries. Its generalization to weighted trees, a.k.a. the weighted ancestor problem, has been extensively explored and successfully reduced to the predecessor problem. It is known that any solution for both problems with an input set from a polynomially bounded universe that preprocesses a weighted tree in O(n polylog(n)) space requires \Omega(loglogn) query time. Perhaps the most important and frequent application of the weighted ancestors problem is for suffix trees. It has been a long-standing open question whether the weighted ancestors problem has better bounds for suffix trees. We answer this question positively: we show that a suffix tree built for a text w[1..n] can be preprocessed using O(n) extra space, so that queries can be answered in O(1) time. Thus we improve the running times of several applications. Our improvement is based on a number of data structure tools and a periodicity-based insight into the combinatorial structure of a suffix tree.Comment: 27 pages, LNCS format. A condensed version will appear in ESA 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Weighted ancestors in suffix trees revisited

Author: Belazzougui Djamal
Kosolobov Dmitry
Puglisi Simon J.
Raman Rajeev
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2021
Field of study

The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require O(log log n) time for queries provided O(n polylog n) space is available and weights are from [0..n], where n is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an O(n)-space solution with constant query time, as was shown by Gawrychowski, Lewenstein, and Nicholson (Proc. ESA 2014). This variant of the problem can be reformulated as follows: given the suffix tree of a string s, we need a data structure that can locate in the tree any substring s[p..q] of s in O(1) time (as if one descended from the root reading s[p..q] along the way). Unfortunately, the data structure of Gawrychowski et al. has no efficient construction algorithm, limiting its wider usage as an algorithmic tool. In this paper we resolve this issue, describing a data structure for weighted ancestors in suffix trees with constant query time and a linear construction algorithm. Our solution is based on a novel approach using so-called irreducible LCP values.Peer reviewe

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Helsingin yliopiston digitaalinen arkisto

Leicester Research Archive

Size-constrained Weighted Ancestors with Applications

Author: Bille Philip
Nekrich Yakov
Pissis Solon P.
Publication venue
Publication date: 27/11/2023
Field of study

The weighted ancestor problem on a rooted node-weighted tree

T

is a generalization of the classic predecessor problem: construct a data structure for a set of integers that supports fast predecessor queries. Both problems are known to require

\Omega(\log\log n)

time for queries provided

\mathcal{O}(n\text{ poly} \log n)

space is available, where

n

is the input size. The weighted ancestor problem has attracted a lot of attention by the combinatorial pattern matching community due to its direct application to suffix trees. In this formulation of the problem, the nodes are weighted by string depth. This attention has culminated in a data structure for weighted ancestors in suffix trees with

\mathcal{O}(1)

query time and an

\mathcal{O}(n)

-time construction algorithm [Belazzougui et al., CPM 2021]. In this paper, we consider a different version of the weighted ancestor problem, where the nodes are weighted by any function

\textsf{weight}

that maps the nodes of

T

to positive integers, such that

\textsf{weight}(u)\le \textsf{size}(u)

for any node

u

and

\textsf{weight}(u_1)\le \textsf{weight}(u_2)

if node

u_1

is a descendant of node

u_2

, where

\textsf{size}(u)

is the number of nodes in the subtree rooted at

u

. In the size-constrained weighted ancestor (SWAQ) problem, for any node

u

T

and any integer

k

, we are asked to return the lowest ancestor

w

u

with weight at least

k

. We show that for any rooted tree with

n

nodes, we can locate node

w

\mathcal{O}(1)

time after

\mathcal{O}(n)

-time preprocessing. In particular, this implies a data structure for the SWAQ problem in suffix trees with

\mathcal{O}(1)

query time and

\mathcal{O}(n)

-time preprocessing, when the nodes are weighted by

\textsf{weight}

. We also show several string-processing applications of this result

arXiv.org e-Print Archive

Heaviest Induced Ancestors and Longest Common Substrings

Author: Gagie Travis
Gawrychowski Paweł
Nekrich Yakov
Publication venue
Publication date: 01/01/2013
Field of study

Suppose we have two trees on the same set of leaves, in which nodes are weighted such that children are heavier than their parents. We say a node from the first tree and a node from the second tree are induced together if they have a common leaf descendant. In this paper we describe data structures that efficiently support the following heaviest-induced-ancestor query: given a node from the first tree and a node from the second tree, find an induced pair of their ancestors with maximum combined weight. Our solutions are based on a geometric interpretation that enables us to find heaviest induced ancestors using range queries. We then show how to use these results to build an LZ-compressed index with which we can quickly find with high probability a longest substring common to the indexed string and a given pattern

arXiv.org e-Print Archive

CiteSeerX

MPG.PuRe

Random Access to Grammar Compressed Strings

Author: Bille Philip
Landau Gad M.
Raman Rajeev
Sadakane Kunihiko
Satti Srinivasa Rao
Weimann Oren
Publication venue
Publication date: 01/01/2011
Field of study

Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string. Let

S

be a string of length

N

compressed into a context-free grammar

\mathcal{S}

of size

n

. We present two representations of

\mathcal{S}

achieving

O(\log N)

random access time, and either

O(n\cdot \alpha_k(n))

construction time and space on the pointer machine model, or

O(n)

construction time and space on the RAM. Here,

\alpha_k(n)

is the inverse of the

k^{th}

row of Ackermann's function. Our representations also efficiently support decompression of any substring in

S

: we can decompress any substring of length

m

in the same complexity as a single random access query and additional

O(m)

time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression. For instance, we can find all approximate occurrences of a pattern

P

with at most

k

errors in time

O(n(\min\{|P|k, k^4 + |P|\} + \log N) + occ)

, where

occ

is the number of occurrences of

P

S

. Finally, we generalize our results to navigation and other operations on grammar-compressed ordered trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy paths in grammars.Comment: Preliminary version in SODA 201

arXiv.org e-Print Archive

Crossref

Online Research Database In Technology

Leicester Research Archive

Speeding-up $q$ -gram mining on grammar-based compressed texts

Author: Bannai Hideo
Goto Keisuke
Inenaga Shunuke
Takeda Masayuki
坂内英夫
後藤啓介
稲永俊介
竹田正幸
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 15/02/2012
Field of study

We present an efficient algorithm for calculating

q

-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP

\mathcal{T}

of size

n

that represents string

T

, the algorithm computes the occurrence frequencies of all

q

-grams in

T

, by reducing the problem to the weighted

q

-gram frequencies problem on a trie-like structure of size

m = |T|-\mathit{dup}(q,\mathcal{T})

, where

\mathit{dup}(q,\mathcal{T})

is a quantity that represents the amount of redundancy that the SLP captures with respect to

q

-grams. The reduced problem can be solved in linear time. Since

m = O(qn)

, the running time of our algorithm is

O(\min\{|T|-\mathit{dup}(q,\mathcal{T}),qn\})

, improving our previous

O(qn)

algorithm when

q = \Omega(|T|/n)

arXiv.org e-Print Archive

Kyushu University Institutional Repository

Cross-Document Pattern Matching

Author: A. Andersson
J.L. Bentley
K. Sadakane
K. Sadakane
M. Farach
M.A. Bender
M.A. Bender
M.A. Bender
M.L. Fredman
O. Berkman
P. Bozanis
P. Dietz
R. Grossi
S. Muthukrishnan
T. Gagie
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

We study a new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic). As a side result, we propose an improved solution to the weighted level ancestor problem

arXiv.org e-Print Archive

CiteSeerX

Crossref

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Pattern matching in Lempel-Ziv compressed strings: fast, simple, and deterministic

Author: Gawrychowski Pawel
Publication venue
Publication date: 01/01/2011
Field of study

Countless variants of the Lempel-Ziv compression are widely used in many real-life applications. This paper is concerned with a natural modification of the classical pattern matching problem inspired by the popularity of such compression methods: given an uncompressed pattern s[1..m] and a Lempel-Ziv representation of a string t[1..N], does s occur in t? Farach and Thorup gave a randomized O(nlog^2(N/n)+m) time solution for this problem, where n is the size of the compressed representation of t. We improve their result by developing a faster and fully deterministic O(nlog(N/n)+m) time algorithm with the same space complexity. Note that for highly compressible texts, log(N/n) might be of order n, so for such inputs the improvement is very significant. A (tiny) fragment of our method can be used to give an asymptotically optimal solution for the substring hashing problem considered by Farach and Muthukrishnan.Comment: submitte

arXiv.org e-Print Archive

CiteSeerX

Wavelet Trees Meet Suffix Trees

Author: Babenko Maxim
Gawrychowski Paweł
Kociumaka Tomasz
Starikovskaya Tatiana
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2015
Field of study

We present an improved wavelet tree construction algorithm and discuss its applications to a number of rank/select problems for integer keys and strings. Given a string of length n over an alphabet of size

\sigma\leq n

, our method builds the wavelet tree in

O(n \log \sigma/ \sqrt{\log{n}})

time, improving upon the state-of-the-art algorithm by a factor of

\sqrt{\log n}

. As a consequence, given an array of n integers we can construct in

O(n \sqrt{\log n})

time a data structure consisting of

O(n)

machine words and capable of answering rank/select queries for the subranges of the array in

O(\log n / \log \log n)

time. This is a

\log \log n

-factor improvement in query time compared to Chan and P\u{a}tra\c{s}cu and a

\sqrt{\log n}

-factor improvement in construction time compared to Brodal et al. Next, we switch to stringological context and propose a novel notion of wavelet suffix trees. For a string w of length n, this data structure occupies

O(n)

words, takes

O(n \sqrt{\log n})

time to construct, and simultaneously captures the combinatorial structure of substrings of w while enabling efficient top-down traversal and binary search. In particular, with a wavelet suffix tree we are able to answer in

O(\log |x|)

time the following two natural analogues of rank/select queries for suffixes of substrings: for substrings x and y of w count the number of suffixes of x that are lexicographically smaller than y, and for a substring x of w and an integer k, find the k-th lexicographically smallest suffix of x. We further show that wavelet suffix trees allow to compute a run-length-encoded Burrows-Wheeler transform of a substring x of w in

O(s \log |x|)

time, where s denotes the length of the resulting run-length encoding. This answers a question by Cormode and Muthukrishnan, who considered an analogous problem for Lempel-Ziv compression.Comment: 33 pages, 5 figures; preliminary version published at SODA 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

MPG.PuRe