2,739 research outputs found

    Dynamic Data Structures for Document Collections and Graphs

    Full text link
    In the dynamic indexing problem, we must maintain a changing collection of text documents so that we can efficiently support insertions, deletions, and pattern matching queries. We are especially interested in developing efficient data structures that store and query the documents in compressed form. All previous compressed solutions to this problem rely on answering rank and select queries on a dynamic sequence of symbols. Because of the lower bound in [Fredman and Saks, 1989], answering rank queries presents a bottleneck in compressed dynamic indexing. In this paper we show how this lower bound can be circumvented using our new framework. We demonstrate that the gap between static and dynamic variants of the indexing problem can be almost closed. Our method is based on a novel framework for adding dynamism to static compressed data structures. Our framework also applies more generally to dynamizing other problems. We show, for example, how our framework can be applied to develop compressed representations of dynamic graphs and binary relations

    The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space

    Full text link
    An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of column-oriented databases, log processing, and other storage and query tasks. In these applications each string can appear several times and the order of the strings in the sequence is relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract interesting features from the sequence. Moreover, space-efficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast memory. We introduce and study the problem of compressed indexed sequence of strings, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while preserving provably good performance for the supported operations. We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet Trie smoothly adapts to a sequence of strings that changes over time. It improves on the state-of-the-art compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space occupancy to close to the entropy of the sequence

    Compressed Text Indexes:From Theory to Practice!

    Full text link
    A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications. The goal of this paper is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective testbeds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel and exciting technology

    Prospects and limitations of full-text index structures in genome analysis

    Get PDF
    The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

    Succinct Representations of Dynamic Strings

    Full text link
    The rank and select operations over a string of length n from an alphabet of size σ\sigma have been used widely in the design of succinct data structures. In many applications, the string itself need be maintained dynamically, allowing characters of the string to be inserted and deleted. Under the word RAM model with word size w=Ω(lgn)w=\Omega(\lg n), we design a succinct representation of dynamic strings using nH0+o(n)lgσ+O(w)nH_0 + o(n)\lg\sigma + O(w) bits to support rank, select, insert and delete in O(lgnlglgn(lgσlglgn+1))O(\frac{\lg n}{\lg\lg n}(\frac{\lg \sigma}{\lg\lg n}+1)) time. When the alphabet size is small, i.e. when \sigma = O(\polylog (n)), including the case in which the string is a bit vector, these operations are supported in O(lgnlglgn)O(\frac{\lg n}{\lg\lg n}) time. Our data structures are more efficient than previous results on the same problem, and we have applied them to improve results on the design and construction of space-efficient text indexes

    Compressed Data Structures for Dynamic Sequences

    Full text link
    We consider the problem of storing a dynamic string SS over an alphabet Σ={1,,σ}\Sigma=\{\,1,\ldots,\sigma\,\} in compressed form. Our representation supports insertions and deletions of symbols and answers three fundamental queries: access(i,S)\mathrm{access}(i,S) returns the ii-th symbol in SS, ranka(i,S)\mathrm{rank}_a(i,S) counts how many times a symbol aa occurs among the first ii positions in SS, and selecta(i,S)\mathrm{select}_a(i,S) finds the position where a symbol aa occurs for the ii-th time. We present the first fully-dynamic data structure for arbitrarily large alphabets that achieves optimal query times for all three operations and supports updates with worst-case time guarantees. Ours is also the first fully-dynamic data structure that needs only nHk+o(nlogσ)nH_k+o(n\log\sigma) bits, where HkH_k is the kk-th order entropy and nn is the string length. Moreover our representation supports extraction of a substring S[i..i+]S[i..i+\ell] in optimal O(logn/loglogn+/logσn)O(\log n/\log\log n + \ell/\log_{\sigma}n) time
    corecore