3,522 research outputs found
Linear pattern matching on sparse suffix trees
Packing several characters into one computer word is a simple and natural way
to compress the representation of a string and to speed up its processing.
Exploiting this idea, we propose an index for a packed string, based on a {\em
sparse suffix tree} \cite{KU-96} with appropriately defined suffix links.
Assuming, under the standard unit-cost RAM model, that a word can store up to
characters ( the alphabet size), our index takes
space, i.e. the same space as the packed string itself.
The resulting pattern matching algorithm runs in time ,
where is the length of the pattern, is the actual number of characters
stored in a word and is the number of pattern occurrences
CiNCT: Compression and retrieval for massive vehicular trajectories via relative movement labeling
In this paper, we present a compressed data structure for moving object
trajectories in a road network, which are represented as sequences of road
edges. Unlike existing compression methods for trajectories in a network, our
method supports pattern matching and decompression from an arbitrary position
while retaining a high compressibility with theoretical guarantees.
Specifically, our method is based on FM-index, a fast and compact data
structure for pattern matching. To enhance the compression, we incorporate the
sparsity of road networks into the data structure. In particular, we present
the novel concepts of relative movement labeling and PseudoRank, each
contributing to significant reductions in data size and query processing time.
Our theoretical analysis and experimental studies reveal the advantages of our
proposed method as compared to existing trajectory compression methods and
FM-index variants
Wavelet Trees Meet Suffix Trees
We present an improved wavelet tree construction algorithm and discuss its
applications to a number of rank/select problems for integer keys and strings.
Given a string of length n over an alphabet of size , our
method builds the wavelet tree in time,
improving upon the state-of-the-art algorithm by a factor of .
As a consequence, given an array of n integers we can construct in time a data structure consisting of machine words and
capable of answering rank/select queries for the subranges of the array in
time. This is a -factor improvement in
query time compared to Chan and P\u{a}tra\c{s}cu and a -factor
improvement in construction time compared to Brodal et al.
Next, we switch to stringological context and propose a novel notion of
wavelet suffix trees. For a string w of length n, this data structure occupies
words, takes time to construct, and simultaneously
captures the combinatorial structure of substrings of w while enabling
efficient top-down traversal and binary search. In particular, with a wavelet
suffix tree we are able to answer in time the following two
natural analogues of rank/select queries for suffixes of substrings: for
substrings x and y of w count the number of suffixes of x that are
lexicographically smaller than y, and for a substring x of w and an integer k,
find the k-th lexicographically smallest suffix of x.
We further show that wavelet suffix trees allow to compute a
run-length-encoded Burrows-Wheeler transform of a substring x of w in time, where s denotes the length of the resulting run-length encoding.
This answers a question by Cormode and Muthukrishnan, who considered an
analogous problem for Lempel-Ziv compression.Comment: 33 pages, 5 figures; preliminary version published at SODA 201
Reverse-Safe Data Structures for Text Indexing
We introduce the notion of reverse-safe data structures. These are data structures that prevent the reconstruction of the data they encode (i.e., they cannot be easily reversed). A data structure D is called z-reverse-safe when there exist at least z datasets with the same set of answers as the ones stored by D. The main challenge is to ensure that D stores as many answers to useful queries as possible, is constructed efficiently, and has size close to the size of the original dataset it encodes. Given a text of length n and an integer z, we propose an algorithm which constructs a z-reverse-safe data structure that has size O(n) and answers pattern matching queries of length at most d optimally, where d is maximal for any such z-reverse-safe data structure. The construction algorithm takes O(n ω log d) time, where ω is the matrix multiplication exponent. We show that, despite the n ω factor, our engineered implementation takes only a few minutes to finish for million-letter texts. We further show that plugging our method in data analysis applications gives insignificant or no data utility loss. Finally, we show how our technique can be extended to support applications under a realistic adversary model
Computing Lempel-Ziv Factorization Online
We present an algorithm which computes the Lempel-Ziv factorization of a word
of length on an alphabet of size online in the
following sense: it reads starting from the left, and, after reading each
characters of , updates the Lempel-Ziv
factorization. The algorithm requires bits of space and O(n
\log^2 n) time. The basis of the algorithm is a sparse suffix tree combined
with wavelet trees
Full-fledged Real-Time Indexing for Constant Size Alphabets
In this paper we describe a data structure that supports pattern matching
queries on a dynamically arriving text over an alphabet ofconstant size. Each
new symbol can be prepended to in O(1) worst-case time. At any moment, we
can report all occurrences of a pattern in the current text in
time, where is the length of and is the number of occurrences.
This resolves, under assumption of constant-size alphabet, a long-standing open
problem of existence of a real-time indexing method for string matching (see
\cite{AmirN08})
- …