Search CORE

5,898 research outputs found

Optimal Substring-Equality Queries with Applications to Sparse Text Indexing

Author: Prezza Nicola
Publication venue
Publication date: 01/01/2020
Field of study

We consider the problem of encoding a string of length

n

from an integer alphabet of size

\sigma

so that access and substring equality queries (that is, determining the equality of any two substrings) can be answered efficiently. Any uniquely-decodable encoding supporting access must take

n\log\sigma + \Theta(\log (n\log\sigma))

bits. We describe a new data structure matching this lower bound when

\sigma\leq n^{O(1)}

while supporting both queries in optimal

O(1)

time. Furthermore, we show that the string can be overwritten in-place with this structure. The redundancy of

\Theta(\log n)

bits and the constant query time break exponentially a lower bound that is known to hold in the read-only model. Using our new string representation, we obtain the first in-place subquadratic (indeed, even sublinear in some cases) algorithms for several string-processing problems in the restore model: the input string is rewritable and must be restored before the computation terminates. In particular, we describe the first in-place subquadratic Monte Carlo solutions to the sparse suffix sorting, sparse LCP array construction, and suffix selection problems. With the sole exception of suffix selection, our algorithms are also the first running in sublinear time for small enough sets of input suffixes. Combining these solutions, we obtain the first sublinear-time Monte Carlo algorithm for building the sparse suffix tree in compact space. We also show how to derandomize our algorithms using small space. This leads to the first Las Vegas in-place algorithm computing the full LCP array in

O(n\log n)

time and to the first Las Vegas in-place algorithms solving the sparse suffix sorting and sparse LCP array construction problems in

O(n^{1.5}\sqrt{\log \sigma})

time. Running times of these Las Vegas algorithms hold in the worst case with high probability.Comment: Refactored according to TALG's reviews. New w.h.p. bounds and Las Vegas algorithm

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Sparse Suffix and LCP Array:Simple, Direct, Small, and Fast

Author: Ayad Lorraine A.K.
Loukidis Grigorios
Pissis Solon P.
Verbeek Hilde
Publication venue
Publication date: 15/12/2023
Field of study

Sparse suffix sorting is the problem of sorting b = o(n) suffixes of a string of length n. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in O(n log b) time, in the worst case, or in O(n) time, when the total number of suffixes with an LCP value greater than 2⌊log n/b⌋+1− 1 is in O(b/ log b), matching the time of optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only 8b + o(b) machine words. We also show that our second algorithm can be trivially amended to work in O(n) time for any uniformly random string. Our algorithms are non-trivial space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in O(n log b) time [STACS 2014]

King's Research Portal

Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

Author: Ayad Lorraine A. K.
Loukides Grigorios
Pissis Solon P.
Verbeek Hilde
Publication venue
Publication date: 13/10/2023
Field of study

Sparse suffix sorting is the problem of sorting

b=o(n)

suffixes of a string of length

n

. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in

\mathcal{O}(n\log b)

time, in the worst case, or in

\mathcal{O}(n)

time, when the total number of suffixes with an LCP value greater than

2^{\lfloor \log \frac{n}{b} \rfloor + 1}-1

is in

\mathcal{O}(b/\log b)

, matching the time of the optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only

8b+o(b)

machine words. Our algorithms are simplified, yet non-trivial, space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in

\mathcal{O}(n\log b)

time [STACS 2014]. We also provide proof-of-concept experiments to justify our claims on simplicity and efficiency.Comment: 16 pages, 1 figur

arXiv.org e-Print Archive

Wavelet Trees Meet Suffix Trees

Author: Babenko Maxim
Gawrychowski Paweł
Kociumaka Tomasz
Starikovskaya Tatiana
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2015
Field of study

We present an improved wavelet tree construction algorithm and discuss its applications to a number of rank/select problems for integer keys and strings. Given a string of length n over an alphabet of size

\sigma\leq n

, our method builds the wavelet tree in

O(n \log \sigma/ \sqrt{\log{n}})

time, improving upon the state-of-the-art algorithm by a factor of

\sqrt{\log n}

. As a consequence, given an array of n integers we can construct in

O(n \sqrt{\log n})

time a data structure consisting of

O(n)

machine words and capable of answering rank/select queries for the subranges of the array in

O(\log n / \log \log n)

time. This is a

\log \log n

-factor improvement in query time compared to Chan and P\u{a}tra\c{s}cu and a

\sqrt{\log n}

-factor improvement in construction time compared to Brodal et al. Next, we switch to stringological context and propose a novel notion of wavelet suffix trees. For a string w of length n, this data structure occupies

O(n)

words, takes

O(n \sqrt{\log n})

time to construct, and simultaneously captures the combinatorial structure of substrings of w while enabling efficient top-down traversal and binary search. In particular, with a wavelet suffix tree we are able to answer in

O(\log |x|)

time the following two natural analogues of rank/select queries for suffixes of substrings: for substrings x and y of w count the number of suffixes of x that are lexicographically smaller than y, and for a substring x of w and an integer k, find the k-th lexicographically smallest suffix of x. We further show that wavelet suffix trees allow to compute a run-length-encoded Burrows-Wheeler transform of a substring x of w in

O(s \log |x|)

time, where s denotes the length of the resulting run-length encoding. This answers a question by Cormode and Muthukrishnan, who considered an analogous problem for Lempel-Ziv compression.Comment: 33 pages, 5 figures; preliminary version published at SODA 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

MPG.PuRe

Deterministic sub-linear space LCE data structures with efficient construction

Author: Bannai Hideo
I Tomohiro
Inenaga Shunsuke
Puglisi Simon J.
Takeda Masayuki
Tanimura Yuka
Publication venue
Publication date: 01/01/2016
Field of study

Given a string

S

n

symbols, a longest common extension query

\mathsf{LCE}(i,j)

asks for the length of the longest common prefix of the

i

th and

j

th suffixes of

S

. LCE queries have several important applications in string processing, perhaps most notably to suffix sorting. Recently, Bille et al. (J. Discrete Algorithms 25:42-50, 2014, Proc. CPM 2015: 65-76) described several data structures for answering LCE queries that offers a space-time trade-off between data structure size and query time. In particular, for a parameter

1 \leq \tau \leq n

, their best deterministic solution is a data structure of size

O(n/\tau)

which allows LCE queries to be answered in

O(\tau)

time. However, the construction time for all deterministic versions of their data structure is quadratic in

n

. In this paper, we propose a deterministic solution that achieves a similar space-time trade-off of

O(\tau\min\{\log\tau,\log\frac{n}{\tau}\})

query time using

O(n/\tau)

space, but significantly improve the construction time to

O(n\tau)

.Comment: updated titl

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

CiNCT: Compression and retrieval for massive vehicular trajectories via relative movement labeling

Author: Ishikawa Yoshiharu
Koide Satoshi
Tadokoro Yukihiro
Xiao Chuan
Publication venue
Publication date: 29/09/2017
Field of study

In this paper, we present a compressed data structure for moving object trajectories in a road network, which are represented as sequences of road edges. Unlike existing compression methods for trajectories in a network, our method supports pattern matching and decompression from an arbitrary position while retaining a high compressibility with theoretical guarantees. Specifically, our method is based on FM-index, a fast and compact data structure for pattern matching. To enhance the compression, we incorporate the sparsity of road networks into the data structure. In particular, we present the novel concepts of relative movement labeling and PseudoRank, each contributing to significant reductions in data size and query processing time. Our theoretical analysis and experimental studies reveal the advantages of our proposed method as compared to existing trajectory compression methods and FM-index variants

arXiv.org e-Print Archive

Crossref