722 research outputs found

    The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space

    Full text link
    An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of column-oriented databases, log processing, and other storage and query tasks. In these applications each string can appear several times and the order of the strings in the sequence is relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract interesting features from the sequence. Moreover, space-efficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast memory. We introduce and study the problem of compressed indexed sequence of strings, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while preserving provably good performance for the supported operations. We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet Trie smoothly adapts to a sequence of strings that changes over time. It improves on the state-of-the-art compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space occupancy to close to the entropy of the sequence

    Rank, select and access in grammar-compressed strings

    Full text link
    Given a string SS of length NN on a fixed alphabet of σ\sigma symbols, a grammar compressor produces a context-free grammar GG of size nn that generates SS and only SS. In this paper we describe data structures to support the following operations on a grammar-compressed string: \mbox{rank}_c(S,i) (return the number of occurrences of symbol cc before position ii in SS); \mbox{select}_c(S,i) (return the position of the iith occurrence of cc in SS); and \mbox{access}(S,i,j) (return substring S[i,j]S[i,j]). For rank and select we describe data structures of size O(nσlogN)O(n\sigma\log N) bits that support the two operations in O(logN)O(\log N) time. We propose another structure that uses O(nσlog(N/n)(logN)1+ϵ)O(n\sigma\log (N/n)(\log N)^{1+\epsilon}) bits and that supports the two queries in O(logN/loglogN)O(\log N/\log\log N), where ϵ>0\epsilon>0 is an arbitrary constant. To our knowledge, we are the first to study the asymptotic complexity of rank and select in the grammar-compressed setting, and we provide a hardness result showing that significantly improving the bounds we achieve would imply a major breakthrough on a hard graph-theoretical problem. Our main result for access is a method that requires O(nlogN)O(n\log N) bits of space and O(logN+m/logσN)O(\log N+m/\log_\sigma N) time to extract m=ji+1m=j-i+1 consecutive symbols from SS. Alternatively, we can achieve O(logN/loglogN+m/logσN)O(\log N/\log\log N+m/\log_\sigma N) query time using O(nlog(N/n)(logN)1+ϵ)O(n\log (N/n)(\log N)^{1+\epsilon}) bits of space. This matches a lower bound stated by Verbin and Yu for strings where NN is polynomially related to nn.Comment: 16 page

    Dynamic Integer Sets with Optimal Rank, Select, and Predecessor Search

    Full text link
    We present a data structure representing a dynamic set S of w-bit integers on a w-bit word RAM. With |S|=n and w > log n and space O(n), we support the following standard operations in O(log n / log w) time: - insert(x) sets S = S + {x}. - delete(x) sets S = S - {x}. - predecessor(x) returns max{y in S | y= x}. - rank(x) returns #{y in S | y< x}. - select(i) returns y in S with rank(y)=i, if any. Our O(log n/log w) bound is optimal for dynamic rank and select, matching a lower bound of Fredman and Saks [STOC'89]. When the word length is large, our time bound is also optimal for dynamic predecessor, matching a static lower bound of Beame and Fich [STOC'99] whenever log n/log w=O(log w/loglog w). Technically, the most interesting aspect of our data structure is that it supports all the above operations in constant time for sets of size n=w^{O(1)}. This resolves a main open problem of Ajtai, Komlos, and Fredman [FOCS'83]. Ajtai et al. presented such a data structure in Yao's abstract cell-probe model with w-bit cells/words, but pointed out that the functions used could not be implemented. As a partial solution to the problem, Fredman and Willard [STOC'90] introduced a fusion node that could handle queries in constant time, but used polynomial time on the updates. We call our small set data structure a dynamic fusion node as it does both queries and updates in constant time.Comment: Presented with different formatting in Proceedings of the 55nd IEEE Symposium on Foundations of Computer Science (FOCS), 2014, pp. 166--175. The new version fixes a bug in one of the bounds stated for predecessor search, pointed out to me by Djamal Belazzougu

    Succinct Indexable Dictionaries with Applications to Encoding kk-ary Trees, Prefix Sums and Multisets

    Full text link
    We consider the {\it indexable dictionary} problem, which consists of storing a set S{0,...,m1}S \subseteq \{0,...,m-1\} for some integer mm, while supporting the operations of \Rank(x), which returns the number of elements in SS that are less than xx if xSx \in S, and -1 otherwise; and \Select(i) which returns the ii-th smallest element in SS. We give a data structure that supports both operations in O(1) time on the RAM model and requires B(n,m)+o(n)+O(lglgm){\cal B}(n,m) + o(n) + O(\lg \lg m) bits to store a set of size nn, where {\cal B}(n,m) = \ceil{\lg {m \choose n}} is the minimum number of bits required to store any nn-element subset from a universe of size mm. Previous dictionaries taking this space only supported (yes/no) membership queries in O(1) time. In the cell probe model we can remove the O(lglgm)O(\lg \lg m) additive term in the space bound, answering a question raised by Fich and Miltersen, and Pagh. We present extensions and applications of our indexable dictionary data structure, including: An information-theoretically optimal representation of a kk-ary cardinal tree that supports standard operations in constant time, A representation of a multiset of size nn from {0,...,m1}\{0,...,m-1\} in B(n,m+n)+o(n){\cal B}(n,m+n) + o(n) bits that supports (appropriate generalizations of) \Rank and \Select operations in constant time, and A representation of a sequence of nn non-negative integers summing up to mm in B(n,m+n)+o(n){\cal B}(n,m+n) + o(n) bits that supports prefix sum queries in constant time.Comment: Final version of SODA 2002 paper; supersedes Leicester Tech report 2002/1

    Dynamic Elias-Fano Representation

    Get PDF
    We show that it is possible to store a dynamic ordered set S of n integers drawn from a bounded universe of size u in space close to the information-theoretic lower bound and preserve, at the same time, the asymptotic time optimality of the operations. Our results leverage on the Elias-Fano representation of monotone integer sequences, which can be shown to be less than half a bit per element away from the information-theoretic minimum. In particular, considering a RAM model with memory word size Theta(log u) bits, when integers are drawn from a polynomial universe of size u = n^gamma for any gamma = Theta(1), we add o(n) bits to the static Elias-Fano representation in order to: 1. support static predecessor/successor queries in O(min{1+log(u/n), loglog n}); 2. make S grow in an append-only fashion by spending O(1) per inserted element; 3. describe a dynamic data structure supporting random access in O(log n / loglog n) worst-case, insertions/deletions in O(log n / loglog n) amortized and predecessor/successor queries in O(min{1+log(u/n), loglog n}) worst-case time. These time bounds are optimal
    corecore