1,842 research outputs found
Wavelet Trees Meet Suffix Trees
We present an improved wavelet tree construction algorithm and discuss its
applications to a number of rank/select problems for integer keys and strings.
Given a string of length n over an alphabet of size , our
method builds the wavelet tree in time,
improving upon the state-of-the-art algorithm by a factor of .
As a consequence, given an array of n integers we can construct in time a data structure consisting of machine words and
capable of answering rank/select queries for the subranges of the array in
time. This is a -factor improvement in
query time compared to Chan and P\u{a}tra\c{s}cu and a -factor
improvement in construction time compared to Brodal et al.
Next, we switch to stringological context and propose a novel notion of
wavelet suffix trees. For a string w of length n, this data structure occupies
words, takes time to construct, and simultaneously
captures the combinatorial structure of substrings of w while enabling
efficient top-down traversal and binary search. In particular, with a wavelet
suffix tree we are able to answer in time the following two
natural analogues of rank/select queries for suffixes of substrings: for
substrings x and y of w count the number of suffixes of x that are
lexicographically smaller than y, and for a substring x of w and an integer k,
find the k-th lexicographically smallest suffix of x.
We further show that wavelet suffix trees allow to compute a
run-length-encoded Burrows-Wheeler transform of a substring x of w in time, where s denotes the length of the resulting run-length encoding.
This answers a question by Cormode and Muthukrishnan, who considered an
analogous problem for Lempel-Ziv compression.Comment: 33 pages, 5 figures; preliminary version published at SODA 201
RLZAP: Relative Lempel-Ziv with Adaptive Pointers
Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of
genomes from individuals of the same species when fast random access is
desired. With Kuruppu et al.'s (SPIRE 2010) original implementation, a
reference genome is selected and then the other genomes are greedily parsed
into phrases exactly matching substrings of the reference. Deorowicz and
Grabowski (Bioinformatics, 2011) pointed out that letting each phrase end with
a mismatch character usually gives better compression because many of the
differences between individuals' genomes are single-nucleotide substitutions.
Ferrada et al. (SPIRE 2014) then pointed out that also using relative pointers
and run-length compressing them usually gives even better compression. In this
paper we generalize Ferrada et al.'s idea to handle well also short insertions,
deletions and multi-character substitutions. We show experimentally that our
generalization achieves better compression than Ferrada et al.'s implementation
with comparable random-access times
A simple online competitive adaptation of Lempel-Ziv compression with efficient random access support
We present a simple adaptation of the Lempel Ziv 78' (LZ78) compression
scheme ({\em IEEE Transactions on Information Theory, 1978}) that supports
efficient random access to the input string. Namely, given query access to the
compressed string, it is possible to efficiently recover any symbol of the
input string. The compression algorithm is given as input a parameter \eps
>0, and with very high probability increases the length of the compressed
string by at most a factor of (1+\eps). The access time is O(\log n +
1/\eps^2) in expectation, and O(\log n/\eps^2) with high probability. The
scheme relies on sparse transitive-closure spanners. Any (consecutive)
substring of the input string can be retrieved at an additional additive cost
in the running time of the length of the substring. We also formally establish
the necessity of modifying LZ78 so as to allow efficient random access.
Specifically, we construct a family of strings for which
queries to the LZ78-compressed string are required in order to recover a single
symbol in the input string. The main benefit of the proposed scheme is that it
preserves the online nature and simplicity of LZ78, and that for {\em every}
input string, the length of the compressed string is only a small factor larger
than that obtained by running LZ78
Complexity of Networks
Network or graph structures are ubiquitous in the study of complex systems.
Often, we are interested in complexity trends of these system as it evolves
under some dynamic. An example might be looking at the complexity of a food web
as species enter an ecosystem via migration or speciation, and leave via
extinction.
In this paper, a complexity measure of networks is proposed based on the {\em
complexity is information content} paradigm. To apply this paradigm to any
object, one must fix two things: a representation language, in which strings of
symbols from some alphabet describe, or stand for the objects being considered;
and a means of determining when two such descriptions refer to the same object.
With these two things set, the information content of an object can be computed
in principle from the number of equivalent descriptions describing a particular
object.
I propose a simple representation language for undirected graphs that can be
encoded as a bitstring, and equivalence is a topological equivalence. I also
present an algorithm for computing the complexity of an arbitrary undirected
network.Comment: Accepted for Australian Conference on Artificial Life (ACAL05). To
appear in Advances in Natural Computation (World Scientific
c-trie++: A Dynamic Trie Tailored for Fast Prefix Searches
Given a dynamic set of strings of total length whose characters
are drawn from an alphabet of size , a keyword dictionary is a data
structure built on that provides locate, prefix search, and update
operations on . Under the assumption that
characters fit into a single machine word , we propose a keyword dictionary
that represents in bits of space,
supporting all operations in expected time on an
input string of length in the word RAM model. This data structure is
underlined with an exhaustive practical evaluation, highlighting the practical
usefulness of the proposed data structure, especially for prefix searches - one
of the most elementary keyword dictionary operations
Universal Compressed Text Indexing
The rise of repetitive datasets has lately generated a lot of interest in
compressed self-indexes based on dictionary compression, a rich and
heterogeneous family that exploits text repetitions in different ways. For each
such compression scheme, several different indexing solutions have been
proposed in the last two decades. To date, the fastest indexes for repetitive
texts are based on the run-length compressed Burrows-Wheeler transform and on
the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on
the other hand, are based on the Lempel-Ziv parsing and on grammar compression.
Indexes for more universal schemes such as collage systems and macro schemes
have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed
that all dictionary compressors can be interpreted as approximation algorithms
for the smallest string attractor, that is, a set of text positions capturing
all distinct substrings. Starting from this observation, in this paper we
develop the first universal compressed self-index, that is, the first indexing
data structure based on string attractors, which can therefore be built on top
of any dictionary-compressed text representation. Let be the size of a
string attractor for a text of length . Our index takes
words of space and supports locating the
occurrences of any pattern of length in
time, for any constant . This is, in particular, the first index
for general macro schemes and collage systems. Our result shows that the
relation between indexing and compression is much deeper than what was
previously thought: the simple property standing at the core of all dictionary
compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment
- …