Search CORE

4,644 research outputs found

Wavelet Trees Meet Suffix Trees

Author: Babenko Maxim
Gawrychowski Paweł
Kociumaka Tomasz
Starikovskaya Tatiana
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2015
Field of study

We present an improved wavelet tree construction algorithm and discuss its applications to a number of rank/select problems for integer keys and strings. Given a string of length n over an alphabet of size

\sigma\leq n

, our method builds the wavelet tree in

O(n \log \sigma/ \sqrt{\log{n}})

time, improving upon the state-of-the-art algorithm by a factor of

\sqrt{\log n}

. As a consequence, given an array of n integers we can construct in

O(n \sqrt{\log n})

time a data structure consisting of

O(n)

machine words and capable of answering rank/select queries for the subranges of the array in

O(\log n / \log \log n)

time. This is a

\log \log n

-factor improvement in query time compared to Chan and P\u{a}tra\c{s}cu and a

\sqrt{\log n}

-factor improvement in construction time compared to Brodal et al. Next, we switch to stringological context and propose a novel notion of wavelet suffix trees. For a string w of length n, this data structure occupies

O(n)

words, takes

O(n \sqrt{\log n})

time to construct, and simultaneously captures the combinatorial structure of substrings of w while enabling efficient top-down traversal and binary search. In particular, with a wavelet suffix tree we are able to answer in

O(\log |x|)

time the following two natural analogues of rank/select queries for suffixes of substrings: for substrings x and y of w count the number of suffixes of x that are lexicographically smaller than y, and for a substring x of w and an integer k, find the k-th lexicographically smallest suffix of x. We further show that wavelet suffix trees allow to compute a run-length-encoded Burrows-Wheeler transform of a substring x of w in

O(s \log |x|)

time, where s denotes the length of the resulting run-length encoding. This answers a question by Cormode and Muthukrishnan, who considered an analogous problem for Lempel-Ziv compression.Comment: 33 pages, 5 figures; preliminary version published at SODA 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

MPG.PuRe

Compressed Representations of Permutations, and Applications

Author: Barbay Jérémy
Navarro Gonzalo
Publication venue
Publication date: 01/01/2008
Field of study

We explore various techniques to compress a permutation

\pi

over n integers, taking advantage of ordered subsequences in

\pi

, while supporting its application

\pi

(i) and the application of its inverse

\pi^{-1}(i)

in small time. Our compression schemes yield several interesting byproducts, in many cases matching, improving or extending the best existing results on applications such as the encoding of a permutation in order to support iterated applications

\pi^k(i)

of it, of integer functions, and of inverted lists and suffix arrays

arXiv.org e-Print Archive

CiteSeerX

Dagstuhl Research Online Publication Server

Dynamic Range Majority Data Structures

Author: A. Andersson
E.D. Demaine
J. Bentley
J. Misra
L. Arge
M. Fredman
P. Bozanis
P. Gupta
R. Karp
S. Durocher
T. Gagie
T. Husfeldt
Y. Lai
Publication venue
Publication date: 01/01/2011
Field of study

Given a set

P

of coloured points on the real line, we study the problem of answering range

\alpha

-majority (or "heavy hitter") queries on

P

. More specifically, for a query range

Q

, we want to return each colour that is assigned to more than an

\alpha

-fraction of the points contained in

Q

. We present a new data structure for answering range

\alpha

-majority queries on a dynamic set of points, where

\alpha \in (0,1)

. Our data structure uses O(n) space, supports queries in

O((\lg n) / \alpha)

time, and updates in

O((\lg n) / \alpha)

amortized time. If the coordinates of the points are integers, then the query time can be improved to

O(\lg n / (\alpha \lg \lg n) + (\lg(1/\alpha))/\alpha))

. For constant values of

\alpha

, this improved query time matches an existing lower bound, for any data structure with polylogarithmic update time. We also generalize our data structure to handle sets of points in d-dimensions, for

d \ge 2

, as well as dynamic arrays, in which each entry is a colour.Comment: 16 pages, Preliminary version appeared in ISAAC 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Copenhagen University Research Information System

Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space

Author: Gagie Travis
Navarro Gonzalo
Prezza Nicola
Publication venue
Publication date: 04/07/2019
Field of study

Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n over an alphabet of size {\sigma} on a RAM machine with words of w = {\Omega}(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length ` in almost-optimal time O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log log_w(n/r + sigma)

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Optimal-Time Text Indexing in BWT-runs Bounded Space

Author: Gagie Travis
Navarro Gonzalo
Prezza Nicola
Publication venue
Publication date: 11/07/2017
Field of study

Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is

r

, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used

O(r)

space and was able to efficiently count the number of occurrences of a pattern of length

m

in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of

r

. Since then, a number of other indexes with space bounded by other measures of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors --- have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the

occ

occurrences efficiently within

O(r)

space (in loglogarithmic time each), and reaching optimal time

O(m+occ)

within

O(r\log(n/r))

space, on a RAM machine of

w=\Omega(\log n)

bits. Within

O(r\log (n/r))

space, our index can also count in optimal time

O(m)

. Raising the space to

O(r w\log_\sigma(n/r))

, we support count and locate in

O(m\log(\sigma)/w)

and

O(m\log(\sigma)/w+occ)

time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using

O(r\log(n/r))

space that replaces the text and extracts any text substring of length

\ell

in almost-optimal time

O(\log(n/r)+\ell\log(\sigma)/w)

. (...continues...

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Online Research Database In Technology

Implementation and analysis of a Top-K retrieval system for strings

Author: Chandrasekaran Sabrina
Publication venue: LSU Digital Commons
Publication date: 01/01/2010
Field of study

Given text which is a union of d documents of strings, D = d1, d2,...., dd, the emphasis of this thesis is to provide a practical framework to retrieve the K most relevant documents for a given pattern P, which comes as a query. This cannot be done directly, as going through every occurrence of the query pattern may prove to be expensive if the number of documents that the pattern occurs in is much more than the number of documents (K) that we require. Some advanced query functionality will be required, as compared to listing the documents that the pattern occurs in, because a de_x000C_ned notion of most relevant must be provided. Therefore, an index needs to be built before hand on T so that the documents can be retrieved very quickly. Traditionally, inverted indexes have proven to be effective in retrieving the Top-K documents. However, inverted indexes have certain disadvantages, which can be overcome by using other data structures like suffix trees and suffix arrays. A framework was originally provided by Muthukrishnan [29] that takes advantage of the number of relevant documents being less than the occurence of the query pattern. He considered two metrics for relevance:frequency and proximity and provided a framework that took O(n log n) space. Recently, Hon et al [14] provided a framework that takes O(n) space to retrieve the Top-K documents with more optimal query times, O(P + K logK) for arbitrary score functions. In this thesis we study the practicality of this index and provide added functionalities, based on the index, to retrieve Top-K documents for specific cases like phrase searching. We also provide functionality to output the K most relevant documents(according to page rank) when two patterns are given as queries

CiteSeerX

Louisiana State University