9 research outputs found

    LZ-Compressed String Dictionaries

    Full text link
    We show how to compress string dictionaries using the Lempel-Ziv (LZ78) data compression algorithm. Our approach is validated experimentally on dictionaries of up to 1.5 GB of uncompressed text. We achieve compression ratios often outperforming the existing alternatives, especially on dictionaries containing many repeated substrings. Our query times remain competitive

    Practical Evaluation of Lempel-Ziv-78 and Lempel-Ziv-Welch Tries

    Full text link
    We present the first thorough practical study of the Lempel-Ziv-78 and the Lempel-Ziv-Welch computation based on trie data structures. With a careful selection of trie representations we can beat well-tuned popular trie data structures like Judy, m-Bonsai or Cedar

    Engineering a Textbook Approach to Index Massive String Dictionaries

    Get PDF
    We study the problem of engineering space-time efficient indexes that support membership and lexicographic (rank) queries on very large static dictionaries of strings. Our solution is based on a very simple approach that consists of decoupling string storage and string indexing by means of a blockwise compression of the sorted dictionary strings (to be stored in external memory) and a succinct implementation of a Patricia trie (to be stored in internal memory) built on the first string of each block. Our experimental evaluation on two new datasets, which are at least one order of magnitude larger than the ones used in the literature, shows that (i) the state-of-the-art compressed string dictionaries (such as FST, PDT, CoCo-trie) do not provide significant benefits if used in an indexing setting compared to Patricia tries, and (ii) our two-level approach enables the indexing of 3.5 billion strings taking 273 GB in less than 200 MB of internal memory, which is available on any commodity machine, while still guaranteeing comparable or faster query performance than those offered by array-based solutions used in modern storage systems, such as RocksDB, thus possibly influencing their future designs

    Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees (Extended Version)

    Get PDF
    Estimating the cost of a query plan is one of the hardest problems in query optimization. This includes cardinality estimates of string search patterns, of multi-word strings like phrases or text snippets in particular. At first sight, suffix trees address this problem. To curb the memory usage of a suffix tree, one often prunes the tree to a certain depth. But this pruning method "takes away" more information from long strings than from short ones. This problem is particularly severe with sets of long strings, the setting studied here. In this article, we propose respective pruning techniques. Our approaches remove characters with low information value. The various variants determine a character\u27s information value in different ways, e.g., by using conditional entropy with respect to previous characters in the string. Our experiments show that, in contrast to the well-known pruned suffix tree, our technique provides significantly better estimations when the tree size is reduced by 60% or less. Due to the redundancy of natural language, our pruning techniques yield hardly any error for tree-size reductions of up to 50%

    Efficient String Dictionary Compression Using String Dictionaries

    Get PDF
    文字列集合を保管するためのデータ構造である文字列辞書に関して,近年,多くの用途でコンパクト性が求められるという実例が報告されている.また,その背景に応じて,Trie や Front-Coding などの辞書を実現するための優れた技法に,Re-Pair などの強力な文書圧縮技法を組み合わせた圧縮文字列辞書が提案されている.本稿では,既存の圧縮文字列辞書の改良を目的とし,文字列辞書の圧縮に文字列辞書を用いるという方策に基づいた辞書構造を提案する.実データを用いた実験より,提案による文字列辞書はRe-Pair により圧縮した辞書と比べ,メモリ効率や検索・復元速度のトレードオフに関して同等の性能を示しつつ,短い時間で構築できることを示した.A string dictionary is a data structure to store a set of strings. Recently, instances have emerged in practice where the size of string dictionaries has become a critical problem in many applications. Consequently, compressed string dictionaries have been proposed by leveraging efficient implementation techniques, such as Trie and Front-Coding, and powerful text compression techniques, such as Re-Pair. In this paper, we propose new dictionary structures based on a strategy using string dictionaries for the compression in order to improve existing compressed ones. We show that our string dictionaries can be constructed in a shorter time compared to the Re-Pair versions with competitive space usage and operation speed, through experiments on real-world datasets

    Indexes and Computation over Compressed Structured Data (Dagstuhl Seminar 13232)

    Get PDF
    This report documents the program and the outcomes of Dagstuhl Seminar 13232 "Indexes and Computation over Compressed Structured Data"

    Top Tree Compression of Tries

    Get PDF
    We present a compressed representation of tries based on top tree compression [ICALP 2013] that works on a standard, comparison-based, pointer machine model of computation and supports efficient prefix search queries. Namely, we show how to preprocess a set of strings of total length n over an alphabet of size sigma into a compressed data structure of worst-case optimal size O(n/log_sigma n) that given a pattern string P of length m determines if P is a prefix of one of the strings in time O(min(m log sigma,m + log n)). We show that this query time is in fact optimal regardless of the size of the data structure. Existing solutions either use Omega(n) space or rely on word RAM techniques, such as tabulation, hashing, address arithmetic, or word-level parallelism, and hence do not work on a pointer machine. Our result is the first solution on a pointer machine that achieves worst-case o(n) space. Along the way, we develop several interesting data structures that work on a pointer machine and are of independent interest. These include an optimal data structures for random access to a grammar-compressed string and an optimal data structure for a variant of the level ancestor problem

    Querying and Efficiently Searching Large, Temporal Text Corpora

    Get PDF
    corecore