Search CORE

HAL-Ecole des Ponts ParisTech

Hal-Diderot

HAL - UPEC / UPEM

Sparse Text Indexing in Small Space

Author: Bille Philip
Fischer Johannes
Gørtz Inge Li
Kopelowitz Tsvi
Sach Benjamin
Vildhøj Hjalte Wedel
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

In this work we present efficient algorithms for constructing sparse suffix trees, sparse suffix arrays and sparse positions heaps for b arbitrary positions of a text T of length n while using only O(b) words of space during the construction. Attempts at breaking the naive bound of Ω(nb) time for constructing sparse suffix trees in O(b) space can be traced back to the origins of string indexing in 1968. First results were only obtained in 1996, but only for the case where the b suffixes were evenly spaced in T. In this paper there is no constraint on the locations of the suffixes. Our main contribution is to show that the sparse suffix tree (and array) can be constructed in O(n log2 b) time. To achieve this we develop a technique, that allows to efficiently answer b longest common prefix queries on suffixes of T, using only O(b) space. We expect that this technique will prove useful in many other applications in which space usage is a concern. Our first solution is Monte-Carlo and outputs the correct tree with high probability. We then give a Las-Vegas algorithm which also uses O(b) space and runs in the same time bounds with high probability when b = O( n). Furthermore, additional tradeoffs between the space usage and the construction time for the Monte-Carlo algorithm are given. Finally, we show that at the expense of slower pattern queries, it is possible to construct sparse position heaps in O(n+ b log b) time and O(b) space

Online Research Database In Technology

Computing Lempel-Ziv Factorization Online

Author: Starikovskaya Tatiana
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

We present an algorithm which computes the Lempel-Ziv factorization of a word

W

of length

n

on an alphabet

\Sigma

of size

\sigma

online in the following sense: it reads

W

starting from the left, and, after reading each

r = O(\log_{\sigma} n)

characters of

W

, updates the Lempel-Ziv factorization. The algorithm requires

O(n \log \sigma)

bits of space and O(n \log^2 n) time. The basis of the algorithm is a sparse suffix tree combined with wavelet trees

Wavelet Trees Meet Suffix Trees

Author: Babenko Maxim
Gawrychowski Paweł
Kociumaka Tomasz
Starikovskaya Tatiana
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2015
Field of study

We present an improved wavelet tree construction algorithm and discuss its applications to a number of rank/select problems for integer keys and strings. Given a string of length n over an alphabet of size

\sigma\leq n

, our method builds the wavelet tree in

O(n \log \sigma/ \sqrt{\log{n}})

time, improving upon the state-of-the-art algorithm by a factor of

\sqrt{\log n}

. As a consequence, given an array of n integers we can construct in

O(n \sqrt{\log n})

time a data structure consisting of

O(n)

machine words and capable of answering rank/select queries for the subranges of the array in

O(\log n / \log \log n)

time. This is a

\log \log n

-factor improvement in query time compared to Chan and P\u{a}tra\c{s}cu and a

\sqrt{\log n}

-factor improvement in construction time compared to Brodal et al. Next, we switch to stringological context and propose a novel notion of wavelet suffix trees. For a string w of length n, this data structure occupies

O(n)

words, takes

O(n \sqrt{\log n})

time to construct, and simultaneously captures the combinatorial structure of substrings of w while enabling efficient top-down traversal and binary search. In particular, with a wavelet suffix tree we are able to answer in

O(\log |x|)

time the following two natural analogues of rank/select queries for suffixes of substrings: for substrings x and y of w count the number of suffixes of x that are lexicographically smaller than y, and for a substring x of w and an integer k, find the k-th lexicographically smallest suffix of x. We further show that wavelet suffix trees allow to compute a run-length-encoded Burrows-Wheeler transform of a substring x of w in

O(s \log |x|)

time, where s denotes the length of the resulting run-length encoding. This answers a question by Cormode and Muthukrishnan, who considered an analogous problem for Lempel-Ziv compression.Comment: 33 pages, 5 figures; preliminary version published at SODA 201

MPG.PuRe

Linear-Space Data Structures for Range Mode Query in Arrays

Author: Durocher Stephane
Morrison Jason
Publication venue
Publication date: 01/01/2011
Field of study

A mode of a multiset

S

is an element

a \in S

of maximum multiplicity; that is,

a

occurs at least as frequently as any other element in

S

. Given a list

A[1:n]

n

items, we consider the problem of constructing a data structure that efficiently answers range mode queries on

A

. Each query consists of an input pair of indices

(i, j)

for which a mode of

A[i:j]

must be returned. We present an

O(n^{2-2\epsilon})

-space static data structure that supports range mode queries in

O(n^\epsilon)

time in the worst case, for any fixed

\epsilon \in [0,1/2]

. When

\epsilon = 1/2

, this corresponds to the first linear-space data structure to guarantee

O(\sqrt{n})

query time. We then describe three additional linear-space data structures that provide

O(k)

O(m)

, and

O(|j-i|)

query time, respectively, where

k

denotes the number of distinct elements in

A

and

m

denotes the frequency of the mode of

A

. Finally, we examine generalizing our data structures to higher dimensions.Comment: 13 pages, 2 figure

CiNCT: Compression and retrieval for massive vehicular trajectories via relative movement labeling

Author: Ishikawa Yoshiharu
Koide Satoshi
Tadokoro Yukihiro
Xiao Chuan
Publication venue
Publication date: 29/09/2017
Field of study

In this paper, we present a compressed data structure for moving object trajectories in a road network, which are represented as sequences of road edges. Unlike existing compression methods for trajectories in a network, our method supports pattern matching and decompression from an arbitrary position while retaining a high compressibility with theoretical guarantees. Specifically, our method is based on FM-index, a fast and compact data structure for pattern matching. To enhance the compression, we incorporate the sparsity of road networks into the data structure. In particular, we present the novel concepts of relative movement labeling and PseudoRank, each contributing to significant reductions in data size and query processing time. Our theoretical analysis and experimental studies reveal the advantages of our proposed method as compared to existing trajectory compression methods and FM-index variants