129 research outputs found
Succinct Indexable Dictionaries with Applications to Encoding -ary Trees, Prefix Sums and Multisets
We consider the {\it indexable dictionary} problem, which consists of storing
a set for some integer , while supporting the
operations of \Rank(x), which returns the number of elements in that are
less than if , and -1 otherwise; and \Select(i) which returns
the -th smallest element in . We give a data structure that supports both
operations in O(1) time on the RAM model and requires bits to store a set of size , where {\cal B}(n,m) = \ceil{\lg
{m \choose n}} is the minimum number of bits required to store any -element
subset from a universe of size . Previous dictionaries taking this space
only supported (yes/no) membership queries in O(1) time. In the cell probe
model we can remove the additive term in the space bound,
answering a question raised by Fich and Miltersen, and Pagh.
We present extensions and applications of our indexable dictionary data
structure, including:
An information-theoretically optimal representation of a -ary cardinal
tree that supports standard operations in constant time,
A representation of a multiset of size from in bits that supports (appropriate generalizations of) \Rank
and \Select operations in constant time, and
A representation of a sequence of non-negative integers summing up to
in bits that supports prefix sum queries in constant
time.Comment: Final version of SODA 2002 paper; supersedes Leicester Tech report
2002/1
More Haste, Less Waste: Lowering the Redundancy in Fully Indexable Dictionaries
We consider the problem of representing, in a compressed format, a bit-vector
of bits with 1s, supporting the following operations, where : returns the number of occurrences of bit in the
prefix ; returns the position of the th occurrence
of bit in . Such a data structure is called \emph{fully indexable
dictionary (FID)} [Raman et al.,2007], and is at least as powerful as
predecessor data structures. Our focus is on space-efficient FIDs on the
\textsc{ram} model with word size and constant time for all
operations, so that the time cost is independent of the input size. Given the
bitstring to be encoded, having length and containing ones, the
minimal amount of information that needs to be stored is . The state of the art in building a FID for is
given in [Patrascu,2008] using
bits, to support the operations in time. Here, we propose a parametric
data structure exhibiting a time/space trade-off such that, for any real
constants , it
uses B(n,m) + O(n^{1+\delta} + n (\frac{m}{n^s})^\eps) bits and performs
all the operations in time O(s\delta^{-1} + \eps^{-1}). The improvement is
twofold: our redundancy can be lowered parametrically and, fixing ,
we get a constant-time FID whose space is B(n,m) + O(m^\eps/\poly{n}) bits,
for sufficiently large . This is a significant improvement compared to the
previous bounds for the general case
Improved ESP-index: a practical self-index for highly repetitive texts
While several self-indexes for highly repetitive texts exist, developing a
practical self-index applicable to real world repetitive texts remains a
challenge. ESP-index is a grammar-based self-index on the notion of
edit-sensitive parsing (ESP), an efficient parsing algorithm that guarantees
upper bounds of parsing discrepancies between different appearances of the same
subtexts in a text. Although ESP-index performs efficient top-down searches of
query texts, it has a serious issue on binary searches for finding appearances
of variables for a query text, which resulted in slowing down the query
searches. We present an improved ESP-index (ESP-index-I) by leveraging the idea
behind succinct data structures for large alphabets. While ESP-index-I keeps
the same types of efficiencies as ESP-index about the top-down searches, it
avoid the binary searches using fast rank/select operations. We experimentally
test ESP-index-I on the ability to search query texts and extract subtexts from
real world repetitive texts on a large-scale, and we show that ESP-index-I
performs better that other possible approaches.Comment: This is the full version of a proceeding accepted to the 11th
International Symposium on Experimental Algorithms (SEA2014
Space-Efficient Graph Coarsening with Applications to Succinct Planar Encodings
We present a novel space-efficient graph coarsening technique for n-vertex planar graphs G, called cloud partition, which partitions the vertices V(G) into disjoint sets C of size O(log n) such that each C induces a connected subgraph of G. Using this partition ? we construct a so-called structure-maintaining minor F of G via specific contractions within the disjoint sets such that F has O(n/log n) vertices. The combination of (F, ?) is referred to as a cloud decomposition.
For planar graphs we show that a cloud decomposition can be constructed in O(n) time and using O(n) bits. Given a cloud decomposition (F, ?) constructed for a planar graph G we are able to find a balanced separator of G in O(n/log n) time. Contrary to related publications, we do not make use of an embedding of the planar input graph. We generalize our cloud decomposition from planar graphs to H-minor-free graphs for any fixed graph H. This allows us to construct the succinct encoding scheme for H-minor-free graphs due to Blelloch and Farzan (CPM 2010) in O(n) time and O(n) bits improving both runtime and space by a factor of ?(log n).
As an additional application of our cloud decomposition we show that, for H-minor-free graphs, a tree decomposition of width O(n^{1/2 + ?}) for any ? > 0 can be constructed in O(n) bits and a time linear in the size of the tree decomposition. A similar result by Izumi and Otachi (ICALP 2020) constructs a tree decomposition of width O(k ?n log n) for graphs of treewidth k ? ?n in sublinear space and polynomial time
The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space
An indexed sequence of strings is a data structure for storing a string
sequence that supports random access, searching, range counting and analytics
operations, both for exact matches and prefix search. String sequences lie at
the core of column-oriented databases, log processing, and other storage and
query tasks. In these applications each string can appear several times and the
order of the strings in the sequence is relevant. The prefix structure of the
strings is relevant as well: common prefixes are sought in strings to extract
interesting features from the sequence. Moreover, space-efficiency is highly
desirable as it translates directly into higher performance, since more data
can fit in fast memory.
We introduce and study the problem of compressed indexed sequence of strings,
representing indexed sequences of strings in nearly-optimal compressed space,
both in the static and dynamic settings, while preserving provably good
performance for the supported operations.
We present a new data structure for this problem, the Wavelet Trie, which
combines the classical Patricia Trie with the Wavelet Tree, a succinct data
structure for storing a compressed sequence. The resulting Wavelet Trie
smoothly adapts to a sequence of strings that changes over time. It improves on
the state-of-the-art compressed data structures by supporting a dynamic
alphabet (i.e. the set of distinct strings) and prefix queries, both crucial
requirements in the aforementioned applications, and on traditional indexes by
reducing space occupancy to close to the entropy of the sequence
R3D3: A doubly opportunistic data structure for compressing and indexing massive data
Opportunistic data structures are used extensively in big data practice to break down the massive storage space requirements of processing large volumes of information. A data structure is called (singly) opportunistic if it takes advantage of the redundancy in the input in order to store it in informationtheoretically minimum space. Yet, efficient data processing requires a separate index alongside the data, whose size often substantially exceeds that of the compressed information. In this paper, we introduce doubly opportunistic data structures to not only attain best possible compression on the input data but also on the index. We present R3D3 that encodes a bitvector of length n and Shannon entropy H0 to nH0 bits and the accompanying index to nH0(1/2 + O(log C/C)) bits, thus attaining provably minimum space (up to small error terms) on both the data and the index, and supports a rich set of queries to arbitrary position in the compressed bitvector in O(C) time when C = o(log n). Our R3D3 prototype attains several times space reduction beyond known compression techniques on a wide range of synthetic and real data sets, while it supports operations on the compressed data at comparable speed
- …