Search CORE

959 research outputs found

Deterministic indexing for packed strings

Author: Bille Philip
Gørtz Inge Li
Skjoldjensen Frederik Rye
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 06/12/2016
Field of study

Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In the deterministic variant the goal is to solve the string indexing problem without any randomization (at preprocessing time or query time). In the packed variant the strings are stored with several character in a single word, giving us the opportunity to read multiple characters simultaneously. Our main result is a new string index in the deterministic and packed setting. Given a packed string S of length n over an alphabet s, we show how to preprocess S in O(n) (deterministic) time and space O(n) such that given a packed pattern string of length m we can support queries in (deterministic) time O(m/a + log m + log log s), where a = w /log s is the number of characters packed in a word of size w = log n. Our query time is always at least as good as the previous best known bounds and whenever several characters are packed in a word, i.e., log s << w, the query times are faster

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Online Research Database In Technology

String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure

Author: A
Alzamel Mai
Counting
Grossi Roberto
Hagerup Torben
Optimal
Uniqueness
Wavelet
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/05/2019
Field of study

Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text

T

of length

n

, permutes its symbols according to the lexicographic order of suffixes of

T

. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length

n

, occupying

O(n/\log n)

machine words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009) runs in

O(n)

time and

O(n/\log n)

space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require

\Omega(n)

time. In this paper, we propose the first algorithm that breaks the

O(n)

-time barrier for BWT construction. Given a binary string of length

n

, our procedure builds the Burrows-Wheeler transform in

O(n/\sqrt{\log n})

time and

O(n/\log n)

space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art

O(m\sqrt{\log m})

-time solution by Chan and P\v{a}tra\c{s}cu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets, which is of independent interest. As one of the applications, we show that this technique lets us design a data structure of the optimal size

O(n/\log n)

that answers Longest Common Extension queries (LCE queries) in

O(1)

time and, furthermore, can be deterministically constructed in the optimal

O(n/\log n)

time.Comment: Full version of a paper accepted to STOC 201

arXiv.org e-Print Archive

Crossref

Matching and Compression of Strings with Automata and Word Packing

Author: Skjoldjensen Frederik Rye
Publication venue: DTU Compute
Publication date: 01/01/2017
Field of study

Online Research Database In Technology

Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space

Author: Gagie Travis
Navarro Gonzalo
Prezza Nicola
Publication venue
Publication date: 04/07/2019
Field of study

Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n over an alphabet of size {\sigma} on a RAM machine with words of w = {\Omega}(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length ` in almost-optimal time O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log log_w(n/r + sigma)

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Optimal-Time Text Indexing in BWT-runs Bounded Space

Author: Gagie Travis
Navarro Gonzalo
Prezza Nicola
Publication venue
Publication date: 11/07/2017
Field of study

Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is

r

, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used

O(r)

space and was able to efficiently count the number of occurrences of a pattern of length

m

in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of

r

. Since then, a number of other indexes with space bounded by other measures of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors --- have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the

occ

occurrences efficiently within

O(r)

space (in loglogarithmic time each), and reaching optimal time

O(m+occ)

within

O(r\log(n/r))

space, on a RAM machine of

w=\Omega(\log n)

bits. Within

O(r\log (n/r))

space, our index can also count in optimal time

O(m)

. Raising the space to

O(r w\log_\sigma(n/r))

, we support count and locate in

O(m\log(\sigma)/w)

and

O(m\log(\sigma)/w+occ)

time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using

O(r\log(n/r))

space that replaces the text and extracts any text substring of length

\ell

in almost-optimal time

O(\log(n/r)+\ell\log(\sigma)/w)

. (...continues...

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Online Research Database In Technology

Optimal Substring-Equality Queries with Applications to Sparse Text Indexing

Author: Prezza Nicola
Publication venue
Publication date: 01/01/2020
Field of study

We consider the problem of encoding a string of length

n

from an integer alphabet of size

\sigma

so that access and substring equality queries (that is, determining the equality of any two substrings) can be answered efficiently. Any uniquely-decodable encoding supporting access must take

n\log\sigma + \Theta(\log (n\log\sigma))

bits. We describe a new data structure matching this lower bound when

\sigma\leq n^{O(1)}

while supporting both queries in optimal

O(1)

time. Furthermore, we show that the string can be overwritten in-place with this structure. The redundancy of

\Theta(\log n)

bits and the constant query time break exponentially a lower bound that is known to hold in the read-only model. Using our new string representation, we obtain the first in-place subquadratic (indeed, even sublinear in some cases) algorithms for several string-processing problems in the restore model: the input string is rewritable and must be restored before the computation terminates. In particular, we describe the first in-place subquadratic Monte Carlo solutions to the sparse suffix sorting, sparse LCP array construction, and suffix selection problems. With the sole exception of suffix selection, our algorithms are also the first running in sublinear time for small enough sets of input suffixes. Combining these solutions, we obtain the first sublinear-time Monte Carlo algorithm for building the sparse suffix tree in compact space. We also show how to derandomize our algorithms using small space. This leads to the first Las Vegas in-place algorithm computing the full LCP array in

O(n\log n)

time and to the first Las Vegas in-place algorithms solving the sparse suffix sorting and sparse LCP array construction problems in

O(n^{1.5}\sqrt{\log \sigma})

time. Running times of these Las Vegas algorithms hold in the worst case with high probability.Comment: Refactored according to TALG's reviews. New w.h.p. bounds and Las Vegas algorithm

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

c-trie++: A Dynamic Trie Tailored for Fast Prefix Searches

Author: Bannai Hideo
Inenaga Shunsuke
Kanda Shunsuke
Köppl Dominik
Nakashima Yuto
Takeda Masayuki
Tsuruta Kazuya
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/10/2020
Field of study

Given a dynamic set

K

k

strings of total length

n

whose characters are drawn from an alphabet of size

\sigma

, a keyword dictionary is a data structure built on

K

that provides locate, prefix search, and update operations on

K

. Under the assumption that

\alpha = w / \lg \sigma

characters fit into a single machine word

w

, we propose a keyword dictionary that represents

K

n \lg \sigma + \Theta(k \lg n)

bits of space, supporting all operations in

O(m / \alpha + \lg \alpha)

expected time on an input string of length

m

in the word RAM model. This data structure is underlined with an exhaustive practical evaluation, highlighting the practical usefulness of the proposed data structure, especially for prefix searches - one of the most elementary keyword dictionary operations

arXiv.org e-Print Archive

Crossref

Full-fledged Real-Time Indexing for Constant Size Alphabets

Author: Kucherov Gregory
Nekrich Yakov
Publication venue
Publication date: 06/07/2013
Field of study

In this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet ofconstant size. Each new symbol can be prepended to

T

in O(1) worst-case time. At any moment, we can report all occurrences of a pattern

P

in the current text in

O(|P|+k)

time, where

|P|

is the length of

P

and

k

is the number of occurrences. This resolves, under assumption of constant-size alphabet, a long-standing open problem of existence of a real-time indexing method for string matching (see \cite{AmirN08})

arXiv.org e-Print Archive

HAL Descartes

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM