200 research outputs found
Succinct Partial Sums and Fenwick Trees
We consider the well-studied partial sums problem in succint space where one
is to maintain an array of n k-bit integers subject to updates such that
partial sums queries can be efficiently answered. We present two succint
versions of the Fenwick Tree - which is known for its simplicity and
practicality. Our results hold in the encoding model where one is allowed to
reuse the space from the input data. Our main result is the first that only
requires nk + o(n) bits of space while still supporting sum/update in O(log_b
n) / O(b log_b n) time where 2 <= b <= log^O(1) n. The second result shows how
optimal time for sum/update can be achieved while only slightly increasing the
space usage to nk + o(nk) bits. Beyond Fenwick Trees, the results are primarily
based on bit-packing and sampling - making them very practical - and they also
allow for simple optimal parallelization
A Faster Implementation of Online Run-Length Burrows-Wheeler Transform
Run-length encoding Burrows-Wheeler Transformed strings, resulting in
Run-Length BWT (RLBWT), is a powerful tool for processing highly repetitive
strings. We propose a new algorithm for online RLBWT working in run-compressed
space, which runs in time and bits of space, where
is the length of input string received so far and is the number of runs
in the BWT of the reversed . We improve the state-of-the-art algorithm for
online RLBWT in terms of empirical construction time. Adopting the dynamic list
for maintaining a total order, we can replace rank queries in a dynamic wavelet
tree on a run-length compressed string by the direct comparison of labels in a
dynamic list. The empirical result for various benchmarks show the efficiency
of our algorithm, especially for highly repetitive strings.Comment: In Proc. IWOCA201
Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation
Given a static reference string and a source string , a relative
compression of with respect to is an encoding of as a sequence of
references to substrings of . Relative compression schemes are a classic
model of compression and have recently proved very successful for compressing
highly-repetitive massive data sets such as genomes and web-data. We initiate
the study of relative compression in a dynamic setting where the compressed
source string is subject to edit operations. The goal is to maintain the
compressed representation compactly, while supporting edits and allowing
efficient random access to the (uncompressed) source string. We present new
data structures that achieve optimal time for updates and queries while using
space linear in the size of the optimal relative compression, for nearly all
combinations of parameters. We also present solutions for restricted and
extended sets of updates. To achieve these results, we revisit the dynamic
partial sums problem and the substring concatenation problem. We present new
optimal or near optimal bounds for these problems. Plugging in our new results
we also immediately obtain new bounds for the string indexing for patterns with
wildcards problem and the dynamic text and static pattern matching problem
A framework of dynamic data structures for string processing
In this paper we present DYNAMIC, an open-source C++ library implementing dynamic compressed data structures for string manipulation. Our framework includes useful tools such as searchable partial sums, succinct/gap-encoded bitvectors, and entropy/run-length compressed strings and FM indexes. We prove close-to-optimal theoretical bounds for the resources used by our structures, and show that our theoretical predictions are empirically tightly verified in practice. To conclude, we turn our attention to applications. We compare the performance of five recently-published compression algorithms implemented using DYNAMIC with those of stateof-the-art tools performing the same task. Our experiments show that algorithms making use of dynamic compressed data structures can be up to three orders of magnitude more space-efficient (albeit slower) than classical ones performing the same tasks
Compressed Data Structures for Dynamic Sequences
We consider the problem of storing a dynamic string over an alphabet
in compressed form. Our representation
supports insertions and deletions of symbols and answers three fundamental
queries: returns the -th symbol in ,
counts how many times a symbol occurs among the
first positions in , and finds the position
where a symbol occurs for the -th time. We present the first
fully-dynamic data structure for arbitrarily large alphabets that achieves
optimal query times for all three operations and supports updates with
worst-case time guarantees. Ours is also the first fully-dynamic data structure
that needs only bits, where is the -th order
entropy and is the string length. Moreover our representation supports
extraction of a substring in optimal time
R3D3: A doubly opportunistic data structure for compressing and indexing massive data
Opportunistic data structures are used extensively in big data practice to break down the massive storage space requirements of processing large volumes of information. A data structure is called (singly) opportunistic if it takes advantage of the redundancy in the input in order to store it in informationtheoretically minimum space. Yet, efficient data processing requires a separate index alongside the data, whose size often substantially exceeds that of the compressed information. In this paper, we introduce doubly opportunistic data structures to not only attain best possible compression on the input data but also on the index. We present R3D3 that encodes a bitvector of length n and Shannon entropy H0 to nH0 bits and the accompanying index to nH0(1/2 + O(log C/C)) bits, thus attaining provably minimum space (up to small error terms) on both the data and the index, and supports a rich set of queries to arbitrary position in the compressed bitvector in O(C) time when C = o(log n). Our R3D3 prototype attains several times space reduction beyond known compression techniques on a wide range of synthetic and real data sets, while it supports operations on the compressed data at comparable speed
Approximate Query Processing over Static Sets and Sliding Windows
Indexing of static and dynamic sets is fundamental to a large set of applications such as information retrieval and caching. Denoting the characteristic vector of the set by B, we consider the problem of encoding sets and multisets to support approximate versions of the operations rank(i) (i.e., computing sum_{j = i}) queries. We study multiple types of approximations (allowing an error in the query or the result) and present lower bounds and succinct data structures for several variants of the problem. We also extend our model to sliding windows, in which we process a stream of elements and compute suffix sums. This is a generalization of the window summation problem that allows the user to specify the window size at query time. Here, we provide an algorithm that supports updates and queries in constant time while requiring just (1+o(1)) factor more space than the fixed-window summation algorithms
- …