2,929 research outputs found

    An implementation of dynamic fully compressed suffix trees

    Get PDF
    Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia InformáticaThis dissertation studies and implements a dynamic fully compressed suffix tree. Suffix trees are important algorithms in stringology and provide optimal solutions for myriads of problems. Suffix trees are used, in bioinformatics to index large volumes of data. For most aplications suffix trees need to be efficient in size and funcionality. Until recently they were very large, suffix trees for the 700 megabyte human genome spawn 40 gigabytes of data. The compressed suffix tree requires less space and the recent static fully compressed suffix tree requires even less space, in fact it requires optimal compressed space. However since it is static it is not suitable for dynamic environments. Chan et. al.[3] proposed the first dynamic compressed suffix tree however the space used for a text of size n is O(n log )bits which is far from the new static solutions. Our goal is to implement a recent proposal by Russo, Arlindo and Navarro[22] that defines a dynamic fully compressed suffix tree and uses only nH0 +O(n log ) bits of space

    Dynamic Data Structures for Document Collections and Graphs

    Full text link
    In the dynamic indexing problem, we must maintain a changing collection of text documents so that we can efficiently support insertions, deletions, and pattern matching queries. We are especially interested in developing efficient data structures that store and query the documents in compressed form. All previous compressed solutions to this problem rely on answering rank and select queries on a dynamic sequence of symbols. Because of the lower bound in [Fredman and Saks, 1989], answering rank queries presents a bottleneck in compressed dynamic indexing. In this paper we show how this lower bound can be circumvented using our new framework. We demonstrate that the gap between static and dynamic variants of the indexing problem can be almost closed. Our method is based on a novel framework for adding dynamism to static compressed data structures. Our framework also applies more generally to dynamizing other problems. We show, for example, how our framework can be applied to develop compressed representations of dynamic graphs and binary relations

    Efficient LZ78 factorization of grammar compressed text

    Full text link
    We present an efficient algorithm for computing the LZ78 factorization of a text, where the text is represented as a straight line program (SLP), which is a context free grammar in the Chomsky normal form that generates a single string. Given an SLP of size nn representing a text SS of length NN, our algorithm computes the LZ78 factorization of TT in O(nN+mlogN)O(n\sqrt{N}+m\log N) time and O(nN+m)O(n\sqrt{N}+m) space, where mm is the number of resulting LZ78 factors. We also show how to improve the algorithm so that the nNn\sqrt{N} term in the time and space complexities becomes either nLnL, where LL is the length of the longest LZ78 factor, or (Nα)(N - \alpha) where α0\alpha \geq 0 is a quantity which depends on the amount of redundancy that the SLP captures with respect to substrings of SS of a certain length. Since m=O(N/logσN)m = O(N/\log_\sigma N) where σ\sigma is the alphabet size, the latter is asymptotically at least as fast as a linear time algorithm which runs on the uncompressed string when σ\sigma is constant, and can be more efficient when the text is compressible, i.e. when mm and nn are small.Comment: SPIRE 201

    New Algorithms for Position Heaps

    Full text link
    We present several results about position heaps, a relatively new alternative to suffix trees and suffix arrays. First, we show that, if we limit the maximum length of patterns to be sought, then we can also limit the height of the heap and reduce the worst-case cost of insertions and deletions. Second, we show how to build a position heap in linear time independent of the size of the alphabet. Third, we show how to augment a position heap such that it supports access to the corresponding suffix array, and vice versa. Fourth, we introduce a variant of a position heap that can be simulated efficiently by a compressed suffix array with a linear number of extra bits

    Fully dynamic data structure for LCE queries in compressed space

    Get PDF
    A Longest Common Extension (LCE) query on a text TT of length NN asks for the length of the longest common prefix of suffixes starting at given two positions. We show that the signature encoding G\mathcal{G} of size w=O(min(zlogNlogM,N))w = O(\min(z \log N \log^* M, N)) [Mehlhorn et al., Algorithmica 17(2):183-198, 1997] of TT, which can be seen as a compressed representation of TT, has a capability to support LCE queries in O(logN+loglogM)O(\log N + \log \ell \log^* M) time, where \ell is the answer to the query, zz is the size of the Lempel-Ziv77 (LZ77) factorization of TT, and M4NM \geq 4N is an integer that can be handled in constant time under word RAM model. In compressed space, this is the fastest deterministic LCE data structure in many cases. Moreover, G\mathcal{G} can be enhanced to support efficient update operations: After processing G\mathcal{G} in O(wfA)O(w f_{\mathcal{A}}) time, we can insert/delete any (sub)string of length yy into/from an arbitrary position of TT in O((y+logNlogM)fA)O((y+ \log N\log^* M) f_{\mathcal{A}}) time, where fA=O(min{loglogMloglogwlogloglogM,logwloglogw})f_{\mathcal{A}} = O(\min \{ \frac{\log\log M \log\log w}{\log\log\log M}, \sqrt{\frac{\log w}{\log\log w}} \}). This yields the first fully dynamic LCE data structure. We also present efficient construction algorithms from various types of inputs: We can construct G\mathcal{G} in O(NfA)O(N f_{\mathcal{A}}) time from uncompressed string TT; in O(nloglognlogNlogM)O(n \log\log n \log N \log^* M) time from grammar-compressed string TT represented by a straight-line program of size nn; and in O(zfAlogNlogM)O(z f_{\mathcal{A}} \log N \log^* M) time from LZ77-compressed string TT with zz factors. On top of the above contributions, we show several applications of our data structures which improve previous best known results on grammar-compressed string processing.Comment: arXiv admin note: text overlap with arXiv:1504.0695

    Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation

    Get PDF
    Given a static reference string RR and a source string SS, a relative compression of SS with respect to RR is an encoding of SS as a sequence of references to substrings of RR. Relative compression schemes are a classic model of compression and have recently proved very successful for compressing highly-repetitive massive data sets such as genomes and web-data. We initiate the study of relative compression in a dynamic setting where the compressed source string SS is subject to edit operations. The goal is to maintain the compressed representation compactly, while supporting edits and allowing efficient random access to the (uncompressed) source string. We present new data structures that achieve optimal time for updates and queries while using space linear in the size of the optimal relative compression, for nearly all combinations of parameters. We also present solutions for restricted and extended sets of updates. To achieve these results, we revisit the dynamic partial sums problem and the substring concatenation problem. We present new optimal or near optimal bounds for these problems. Plugging in our new results we also immediately obtain new bounds for the string indexing for patterns with wildcards problem and the dynamic text and static pattern matching problem

    The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space

    Full text link
    An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of column-oriented databases, log processing, and other storage and query tasks. In these applications each string can appear several times and the order of the strings in the sequence is relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract interesting features from the sequence. Moreover, space-efficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast memory. We introduce and study the problem of compressed indexed sequence of strings, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while preserving provably good performance for the supported operations. We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet Trie smoothly adapts to a sequence of strings that changes over time. It improves on the state-of-the-art compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space occupancy to close to the entropy of the sequence
    corecore