Search CORE

2,906 research outputs found

The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space

Author: Grossi Roberto
Ottaviano Giuseppe
Publication venue
Publication date: 01/01/2012
Field of study

An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of column-oriented databases, log processing, and other storage and query tasks. In these applications each string can appear several times and the order of the strings in the sequence is relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract interesting features from the sequence. Moreover, space-efficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast memory. We introduce and study the problem of compressed indexed sequence of strings, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while preserving provably good performance for the supported operations. We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet Trie smoothly adapts to a sequence of strings that changes over time. It improves on the state-of-the-art compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space occupancy to close to the entropy of the sequence

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Rank, select and access in grammar-compressed strings

Author: Belazzougui Djamal
Puglisi Simon J.
Tabei Yasuo
Publication venue
Publication date: 14/08/2014
Field of study

Given a string

S

of length

N

on a fixed alphabet of

\sigma

symbols, a grammar compressor produces a context-free grammar

G

of size

n

that generates

S

and only

S

. In this paper we describe data structures to support the following operations on a grammar-compressed string: \mbox{rank}_c(S,i) (return the number of occurrences of symbol

c

before position

i

S

); \mbox{select}_c(S,i) (return the position of the

i

th occurrence of

c

S

); and \mbox{access}(S,i,j) (return substring

S[i,j]

). For rank and select we describe data structures of size

O(n\sigma\log N)

bits that support the two operations in

O(\log N)

time. We propose another structure that uses

O(n\sigma\log (N/n)(\log N)^{1+\epsilon})

bits and that supports the two queries in

O(\log N/\log\log N)

, where

\epsilon>0

is an arbitrary constant. To our knowledge, we are the first to study the asymptotic complexity of rank and select in the grammar-compressed setting, and we provide a hardness result showing that significantly improving the bounds we achieve would imply a major breakthrough on a hard graph-theoretical problem. Our main result for access is a method that requires

O(n\log N)

bits of space and

O(\log N+m/\log_\sigma N)

time to extract

m=j-i+1

consecutive symbols from

S

. Alternatively, we can achieve

O(\log N/\log\log N+m/\log_\sigma N)

query time using

O(n\log (N/n)(\log N)^{1+\epsilon})

bits of space. This matches a lower bound stated by Verbin and Yu for strings where

N

is polynomially related to

n

.Comment: 16 page

arXiv.org e-Print Archive

CiteSeerX

CiNCT: Compression and retrieval for massive vehicular trajectories via relative movement labeling

Author: Ishikawa Yoshiharu
Koide Satoshi
Tadokoro Yukihiro
Xiao Chuan
Publication venue
Publication date: 29/09/2017
Field of study

In this paper, we present a compressed data structure for moving object trajectories in a road network, which are represented as sequences of road edges. Unlike existing compression methods for trajectories in a network, our method supports pattern matching and decompression from an arbitrary position while retaining a high compressibility with theoretical guarantees. Specifically, our method is based on FM-index, a fast and compact data structure for pattern matching. To enhance the compression, we incorporate the sparsity of road networks into the data structure. In particular, we present the novel concepts of relative movement labeling and PseudoRank, each contributing to significant reductions in data size and query processing time. Our theoretical analysis and experimental studies reveal the advantages of our proposed method as compared to existing trajectory compression methods and FM-index variants

arXiv.org e-Print Archive

Crossref