409 research outputs found

    c-trie++: A Dynamic Trie Tailored for Fast Prefix Searches

    Full text link
    Given a dynamic set KK of kk strings of total length nn whose characters are drawn from an alphabet of size σ\sigma, a keyword dictionary is a data structure built on KK that provides locate, prefix search, and update operations on KK. Under the assumption that α=w/lgσ\alpha = w / \lg \sigma characters fit into a single machine word ww, we propose a keyword dictionary that represents KK in nlgσ+Θ(klgn)n \lg \sigma + \Theta(k \lg n) bits of space, supporting all operations in O(m/α+lgα)O(m / \alpha + \lg \alpha) expected time on an input string of length mm in the word RAM model. This data structure is underlined with an exhaustive practical evaluation, highlighting the practical usefulness of the proposed data structure, especially for prefix searches - one of the most elementary keyword dictionary operations

    Practical Evaluation of Lempel-Ziv-78 and Lempel-Ziv-Welch Tries

    Full text link
    We present the first thorough practical study of the Lempel-Ziv-78 and the Lempel-Ziv-Welch computation based on trie data structures. With a careful selection of trie representations we can beat well-tuned popular trie data structures like Judy, m-Bonsai or Cedar

    Engineering Fully-Compressed Suffix Trees

    Get PDF

    The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space

    Full text link
    An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of column-oriented databases, log processing, and other storage and query tasks. In these applications each string can appear several times and the order of the strings in the sequence is relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract interesting features from the sequence. Moreover, space-efficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast memory. We introduce and study the problem of compressed indexed sequence of strings, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while preserving provably good performance for the supported operations. We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet Trie smoothly adapts to a sequence of strings that changes over time. It improves on the state-of-the-art compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space occupancy to close to the entropy of the sequence

    MergedTrie: Efficient textual indexing

    Get PDF
    The accessing and processing of textual information (i.e. the storing and querying of a set of strings) is especially important for many current applications (e.g. information retrieval and social networks), especially when working in the fields of Big Data or IoT, which require the handling of very large string dictionaries. Typical data structures for textual indexing are Hash Tables and some variants of Tries such as the Double Trie (DT). In this paper, we propose an extension of the DT that we have called MergedTrie. It improves the DT compression by merging both Tries into a single and by segmenting the indexed term into two fixed length parts in order to balance the new Trie. Thus, a higher overlapping of both prefixes and suffixes is obtained. Moreover, we propose a new implementation of Tries that achieves better compression rates than the Double-Array representation usually chosen for implementing Tries. Our proposal also overcomes the limitation of static implementations that does not allow insertions and updates in their compact representations. Finally, our MergedTrie implementation experimentally improves the efficiency of the Hash Tables, the DTs, the Double-Array, the Crit-bit, the Directed Acyclic Word Graphs (DAWG), and the Acyclic Deterministic Finite Automata (ADFA) data structures, requiring less space than the original text to be indexed.This study has been partially funded by the SEQUOIA-UA (TIN2015-63502-C3-3-R) and the RESCATA (TIN2015-65100-R) projects of the Spanish Ministry of Economy and Competitiveness (MINECO)
    corecore