719 research outputs found

    Succinct Representations of Dynamic Strings

    Full text link
    The rank and select operations over a string of length n from an alphabet of size σ\sigma have been used widely in the design of succinct data structures. In many applications, the string itself need be maintained dynamically, allowing characters of the string to be inserted and deleted. Under the word RAM model with word size w=Ω(lgn)w=\Omega(\lg n), we design a succinct representation of dynamic strings using nH0+o(n)lgσ+O(w)nH_0 + o(n)\lg\sigma + O(w) bits to support rank, select, insert and delete in O(lgnlglgn(lgσlglgn+1))O(\frac{\lg n}{\lg\lg n}(\frac{\lg \sigma}{\lg\lg n}+1)) time. When the alphabet size is small, i.e. when \sigma = O(\polylog (n)), including the case in which the string is a bit vector, these operations are supported in O(lgnlglgn)O(\frac{\lg n}{\lg\lg n}) time. Our data structures are more efficient than previous results on the same problem, and we have applied them to improve results on the design and construction of space-efficient text indexes

    Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation

    Get PDF
    Given a static reference string RR and a source string SS, a relative compression of SS with respect to RR is an encoding of SS as a sequence of references to substrings of RR. Relative compression schemes are a classic model of compression and have recently proved very successful for compressing highly-repetitive massive data sets such as genomes and web-data. We initiate the study of relative compression in a dynamic setting where the compressed source string SS is subject to edit operations. The goal is to maintain the compressed representation compactly, while supporting edits and allowing efficient random access to the (uncompressed) source string. We present new data structures that achieve optimal time for updates and queries while using space linear in the size of the optimal relative compression, for nearly all combinations of parameters. We also present solutions for restricted and extended sets of updates. To achieve these results, we revisit the dynamic partial sums problem and the substring concatenation problem. We present new optimal or near optimal bounds for these problems. Plugging in our new results we also immediately obtain new bounds for the string indexing for patterns with wildcards problem and the dynamic text and static pattern matching problem

    Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

    Get PDF
    Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational Linguistics (TACL) 201

    More Haste, Less Waste: Lowering the Redundancy in Fully Indexable Dictionaries

    Get PDF
    We consider the problem of representing, in a compressed format, a bit-vector SS of mm bits with nn 1s, supporting the following operations, where b{0,1}b \in \{0, 1 \}: rankb(S,i)rank_b(S,i) returns the number of occurrences of bit bb in the prefix S[1..i]S[1..i]; selectb(S,i)select_b(S,i) returns the position of the iith occurrence of bit bb in SS. Such a data structure is called \emph{fully indexable dictionary (FID)} [Raman et al.,2007], and is at least as powerful as predecessor data structures. Our focus is on space-efficient FIDs on the \textsc{ram} model with word size Θ(lgm)\Theta(\lg m) and constant time for all operations, so that the time cost is independent of the input size. Given the bitstring SS to be encoded, having length mm and containing nn ones, the minimal amount of information that needs to be stored is B(n,m)=log(mn)B(n,m) = \lceil \log {{m}\choose{n}} \rceil. The state of the art in building a FID for SS is given in [Patrascu,2008] using B(m,n)+O(m/((logm/t)t))+O(m3/4)B(m,n)+O(m / ((\log m/ t) ^t)) + O(m^{3/4}) bits, to support the operations in O(t)O(t) time. Here, we propose a parametric data structure exhibiting a time/space trade-off such that, for any real constants 000 0, it uses B(n,m) + O(n^{1+\delta} + n (\frac{m}{n^s})^\eps) bits and performs all the operations in time O(s\delta^{-1} + \eps^{-1}). The improvement is twofold: our redundancy can be lowered parametrically and, fixing s=O(1)s = O(1), we get a constant-time FID whose space is B(n,m) + O(m^\eps/\poly{n}) bits, for sufficiently large mm. This is a significant improvement compared to the previous bounds for the general case
    corecore