719 research outputs found
Succinct Representations of Dynamic Strings
The rank and select operations over a string of length n from an alphabet of
size have been used widely in the design of succinct data structures.
In many applications, the string itself need be maintained dynamically,
allowing characters of the string to be inserted and deleted. Under the word
RAM model with word size , we design a succinct representation
of dynamic strings using bits to support rank,
select, insert and delete in time. When the alphabet size is small, i.e. when \sigma = O(\polylog
(n)), including the case in which the string is a bit vector, these operations
are supported in time. Our data structures are more
efficient than previous results on the same problem, and we have applied them
to improve results on the design and construction of space-efficient text
indexes
Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation
Given a static reference string and a source string , a relative
compression of with respect to is an encoding of as a sequence of
references to substrings of . Relative compression schemes are a classic
model of compression and have recently proved very successful for compressing
highly-repetitive massive data sets such as genomes and web-data. We initiate
the study of relative compression in a dynamic setting where the compressed
source string is subject to edit operations. The goal is to maintain the
compressed representation compactly, while supporting edits and allowing
efficient random access to the (uncompressed) source string. We present new
data structures that achieve optimal time for updates and queries while using
space linear in the size of the optimal relative compression, for nearly all
combinations of parameters. We also present solutions for restricted and
extended sets of updates. To achieve these results, we revisit the dynamic
partial sums problem and the substring concatenation problem. We present new
optimal or near optimal bounds for these problems. Plugging in our new results
we also immediately obtain new bounds for the string indexing for patterns with
wildcards problem and the dynamic text and static pattern matching problem
Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees
Efficient methods for storing and querying are critical for scaling
high-order n-gram language models to large corpora. We propose a language model
based on compressed suffix trees, a representation that is highly compact and
can be easily held in memory, while supporting queries needed in computing
language model probabilities on-the-fly. We present several optimisations which
improve query runtimes up to 2500x, despite only incurring a modest increase in
construction time and memory usage. For large corpora and high Markov orders,
our method is highly competitive with the state-of-the-art KenLM package. It
imposes much lower memory requirements, often by orders of magnitude, and has
runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational
Linguistics (TACL) 201
More Haste, Less Waste: Lowering the Redundancy in Fully Indexable Dictionaries
We consider the problem of representing, in a compressed format, a bit-vector
of bits with 1s, supporting the following operations, where : returns the number of occurrences of bit in the
prefix ; returns the position of the th occurrence
of bit in . Such a data structure is called \emph{fully indexable
dictionary (FID)} [Raman et al.,2007], and is at least as powerful as
predecessor data structures. Our focus is on space-efficient FIDs on the
\textsc{ram} model with word size and constant time for all
operations, so that the time cost is independent of the input size. Given the
bitstring to be encoded, having length and containing ones, the
minimal amount of information that needs to be stored is . The state of the art in building a FID for is
given in [Patrascu,2008] using
bits, to support the operations in time. Here, we propose a parametric
data structure exhibiting a time/space trade-off such that, for any real
constants , it
uses B(n,m) + O(n^{1+\delta} + n (\frac{m}{n^s})^\eps) bits and performs
all the operations in time O(s\delta^{-1} + \eps^{-1}). The improvement is
twofold: our redundancy can be lowered parametrically and, fixing ,
we get a constant-time FID whose space is B(n,m) + O(m^\eps/\poly{n}) bits,
for sufficiently large . This is a significant improvement compared to the
previous bounds for the general case
- …