2,929 research outputs found
An implementation of dynamic fully compressed suffix trees
Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia InformáticaThis dissertation studies and implements a dynamic fully compressed suffix tree.
Suffix trees are important algorithms in stringology and provide optimal solutions
for myriads of problems. Suffix trees are used, in bioinformatics to index large
volumes of data. For most aplications suffix trees need to be efficient in size and
funcionality. Until recently they were very large, suffix trees for the 700 megabyte
human genome spawn 40 gigabytes of data.
The compressed suffix tree requires less space and the recent static fully compressed
suffix tree requires even less space, in fact it requires optimal compressed space.
However since it is static it is not suitable for dynamic environments. Chan et.
al.[3] proposed the first dynamic compressed suffix tree however the space used for
a text of size n is O(n log )bits which is far from the new static solutions. Our
goal is to implement a recent proposal by Russo, Arlindo and Navarro[22] that defines a dynamic fully compressed suffix tree and uses only nH0 +O(n log ) bits of space
Dynamic Data Structures for Document Collections and Graphs
In the dynamic indexing problem, we must maintain a changing collection of
text documents so that we can efficiently support insertions, deletions, and
pattern matching queries. We are especially interested in developing efficient
data structures that store and query the documents in compressed form. All
previous compressed solutions to this problem rely on answering rank and select
queries on a dynamic sequence of symbols. Because of the lower bound in
[Fredman and Saks, 1989], answering rank queries presents a bottleneck in
compressed dynamic indexing. In this paper we show how this lower bound can be
circumvented using our new framework. We demonstrate that the gap between
static and dynamic variants of the indexing problem can be almost closed. Our
method is based on a novel framework for adding dynamism to static compressed
data structures. Our framework also applies more generally to dynamizing other
problems. We show, for example, how our framework can be applied to develop
compressed representations of dynamic graphs and binary relations
Efficient LZ78 factorization of grammar compressed text
We present an efficient algorithm for computing the LZ78 factorization of a
text, where the text is represented as a straight line program (SLP), which is
a context free grammar in the Chomsky normal form that generates a single
string. Given an SLP of size representing a text of length , our
algorithm computes the LZ78 factorization of in time
and space, where is the number of resulting LZ78 factors.
We also show how to improve the algorithm so that the term in the
time and space complexities becomes either , where is the length of the
longest LZ78 factor, or where is a quantity
which depends on the amount of redundancy that the SLP captures with respect to
substrings of of a certain length. Since where
is the alphabet size, the latter is asymptotically at least as fast as
a linear time algorithm which runs on the uncompressed string when is
constant, and can be more efficient when the text is compressible, i.e. when
and are small.Comment: SPIRE 201
New Algorithms for Position Heaps
We present several results about position heaps, a relatively new alternative
to suffix trees and suffix arrays. First, we show that, if we limit the maximum
length of patterns to be sought, then we can also limit the height of the heap
and reduce the worst-case cost of insertions and deletions. Second, we show how
to build a position heap in linear time independent of the size of the
alphabet. Third, we show how to augment a position heap such that it supports
access to the corresponding suffix array, and vice versa. Fourth, we introduce
a variant of a position heap that can be simulated efficiently by a compressed
suffix array with a linear number of extra bits
Fully dynamic data structure for LCE queries in compressed space
A Longest Common Extension (LCE) query on a text of length asks for
the length of the longest common prefix of suffixes starting at given two
positions. We show that the signature encoding of size [Mehlhorn et al., Algorithmica 17(2):183-198,
1997] of , which can be seen as a compressed representation of , has a
capability to support LCE queries in time,
where is the answer to the query, is the size of the Lempel-Ziv77
(LZ77) factorization of , and is an integer that can be handled
in constant time under word RAM model. In compressed space, this is the fastest
deterministic LCE data structure in many cases. Moreover, can be
enhanced to support efficient update operations: After processing
in time, we can insert/delete any (sub)string of length
into/from an arbitrary position of in time, where . This yields
the first fully dynamic LCE data structure. We also present efficient
construction algorithms from various types of inputs: We can construct
in time from uncompressed string ; in
time from grammar-compressed string
represented by a straight-line program of size ; and in time from LZ77-compressed string with factors. On top
of the above contributions, we show several applications of our data structures
which improve previous best known results on grammar-compressed string
processing.Comment: arXiv admin note: text overlap with arXiv:1504.0695
Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation
Given a static reference string and a source string , a relative
compression of with respect to is an encoding of as a sequence of
references to substrings of . Relative compression schemes are a classic
model of compression and have recently proved very successful for compressing
highly-repetitive massive data sets such as genomes and web-data. We initiate
the study of relative compression in a dynamic setting where the compressed
source string is subject to edit operations. The goal is to maintain the
compressed representation compactly, while supporting edits and allowing
efficient random access to the (uncompressed) source string. We present new
data structures that achieve optimal time for updates and queries while using
space linear in the size of the optimal relative compression, for nearly all
combinations of parameters. We also present solutions for restricted and
extended sets of updates. To achieve these results, we revisit the dynamic
partial sums problem and the substring concatenation problem. We present new
optimal or near optimal bounds for these problems. Plugging in our new results
we also immediately obtain new bounds for the string indexing for patterns with
wildcards problem and the dynamic text and static pattern matching problem
The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space
An indexed sequence of strings is a data structure for storing a string
sequence that supports random access, searching, range counting and analytics
operations, both for exact matches and prefix search. String sequences lie at
the core of column-oriented databases, log processing, and other storage and
query tasks. In these applications each string can appear several times and the
order of the strings in the sequence is relevant. The prefix structure of the
strings is relevant as well: common prefixes are sought in strings to extract
interesting features from the sequence. Moreover, space-efficiency is highly
desirable as it translates directly into higher performance, since more data
can fit in fast memory.
We introduce and study the problem of compressed indexed sequence of strings,
representing indexed sequences of strings in nearly-optimal compressed space,
both in the static and dynamic settings, while preserving provably good
performance for the supported operations.
We present a new data structure for this problem, the Wavelet Trie, which
combines the classical Patricia Trie with the Wavelet Tree, a succinct data
structure for storing a compressed sequence. The resulting Wavelet Trie
smoothly adapts to a sequence of strings that changes over time. It improves on
the state-of-the-art compressed data structures by supporting a dynamic
alphabet (i.e. the set of distinct strings) and prefix queries, both crucial
requirements in the aforementioned applications, and on traditional indexes by
reducing space occupancy to close to the entropy of the sequence
- …