2,906 research outputs found

    The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space

    Full text link
    An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of column-oriented databases, log processing, and other storage and query tasks. In these applications each string can appear several times and the order of the strings in the sequence is relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract interesting features from the sequence. Moreover, space-efficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast memory. We introduce and study the problem of compressed indexed sequence of strings, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while preserving provably good performance for the supported operations. We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet Trie smoothly adapts to a sequence of strings that changes over time. It improves on the state-of-the-art compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space occupancy to close to the entropy of the sequence

    Rank, select and access in grammar-compressed strings

    Full text link
    Given a string SS of length NN on a fixed alphabet of σ\sigma symbols, a grammar compressor produces a context-free grammar GG of size nn that generates SS and only SS. In this paper we describe data structures to support the following operations on a grammar-compressed string: \mbox{rank}_c(S,i) (return the number of occurrences of symbol cc before position ii in SS); \mbox{select}_c(S,i) (return the position of the iith occurrence of cc in SS); and \mbox{access}(S,i,j) (return substring S[i,j]S[i,j]). For rank and select we describe data structures of size O(nσlogN)O(n\sigma\log N) bits that support the two operations in O(logN)O(\log N) time. We propose another structure that uses O(nσlog(N/n)(logN)1+ϵ)O(n\sigma\log (N/n)(\log N)^{1+\epsilon}) bits and that supports the two queries in O(logN/loglogN)O(\log N/\log\log N), where ϵ>0\epsilon>0 is an arbitrary constant. To our knowledge, we are the first to study the asymptotic complexity of rank and select in the grammar-compressed setting, and we provide a hardness result showing that significantly improving the bounds we achieve would imply a major breakthrough on a hard graph-theoretical problem. Our main result for access is a method that requires O(nlogN)O(n\log N) bits of space and O(logN+m/logσN)O(\log N+m/\log_\sigma N) time to extract m=ji+1m=j-i+1 consecutive symbols from SS. Alternatively, we can achieve O(logN/loglogN+m/logσN)O(\log N/\log\log N+m/\log_\sigma N) query time using O(nlog(N/n)(logN)1+ϵ)O(n\log (N/n)(\log N)^{1+\epsilon}) bits of space. This matches a lower bound stated by Verbin and Yu for strings where NN is polynomially related to nn.Comment: 16 page

    CiNCT: Compression and retrieval for massive vehicular trajectories via relative movement labeling

    Full text link
    In this paper, we present a compressed data structure for moving object trajectories in a road network, which are represented as sequences of road edges. Unlike existing compression methods for trajectories in a network, our method supports pattern matching and decompression from an arbitrary position while retaining a high compressibility with theoretical guarantees. Specifically, our method is based on FM-index, a fast and compact data structure for pattern matching. To enhance the compression, we incorporate the sparsity of road networks into the data structure. In particular, we present the novel concepts of relative movement labeling and PseudoRank, each contributing to significant reductions in data size and query processing time. Our theoretical analysis and experimental studies reveal the advantages of our proposed method as compared to existing trajectory compression methods and FM-index variants
    corecore