Search CORE

9 research outputs found

LZ-Compressed String Dictionaries

Author: Arz Julian
Fischer Johannes
Publication venue
Publication date: 03/05/2013
Field of study

We show how to compress string dictionaries using the Lempel-Ziv (LZ78) data compression algorithm. Our approach is validated experimentally on dictionaries of up to 1.5 GB of uncompressed text. We achieve compression ratios often outperforming the existing alternatives, especially on dictionaries containing many repeated substrings. Our query times remain competitive

arXiv.org e-Print Archive

Crossref

Practical Evaluation of Lempel-Ziv-78 and Lempel-Ziv-Welch Tries

Author: A Poyias
D Arroyuelo
D Lemire
D Lemire
D Lemire
G Marsaglia
GH Gonnet
H Bannai
H Luan
J Fischer
J Fischer
J Jansson
J Kärkkäinen
J Ziv
J Ziv
JA Feldman
JG Cleary
K Chung
L Carter
P Tchebychev
RM Karp
RM Robinson
TA Welch
Y Nakashima
Publication venue
Publication date: 09/06/2017
Field of study

We present the first thorough practical study of the Lempel-Ziv-78 and the Lempel-Ziv-Welch computation based on trie data structures. With a careful selection of trie representations we can beat well-tuned popular trie data structures like Judy, m-Bonsai or Cedar

arXiv.org e-Print Archive

Crossref

Engineering a Textbook Approach to Index Massive String Dictionaries

Author: Ferragina Paolo
Rotundo Mariagiovanna
Vinciguerra Giorgio
Publication venue: place:Heidelberg
Publication date: 01/01/2023
Field of study

We study the problem of engineering space-time efficient indexes that support membership and lexicographic (rank) queries on very large static dictionaries of strings. Our solution is based on a very simple approach that consists of decoupling string storage and string indexing by means of a blockwise compression of the sorted dictionary strings (to be stored in external memory) and a succinct implementation of a Patricia trie (to be stored in internal memory) built on the first string of each block. Our experimental evaluation on two new datasets, which are at least one order of magnitude larger than the ones used in the literature, shows that (i) the state-of-the-art compressed string dictionaries (such as FST, PDT, CoCo-trie) do not provide significant benefits if used in an indexing setting compared to Patricia tries, and (ii) our two-level approach enables the indexing of 3.5 billion strings taking 273 GB in less than 200 MB of internal memory, which is available on any commodity machine, while still guaranteeing comparable or faster query performance than those offered by array-based solutions used in modern storage systems, such as RocksDB, thus possibly influencing their future designs

Archivio della Ricerca - Università di Pisa

Archivio della ricerca della Scuola Superiore Sant'Anna

Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees (Extended Version)

Author: Böhm Klemens
Schäler Martin
Willkomm Jens
Publication venue: Karlsruher Institut für Technologie
Publication date: 16/01/2021
Field of study

Estimating the cost of a query plan is one of the hardest problems in query optimization. This includes cardinality estimates of string search patterns, of multi-word strings like phrases or text snippets in particular. At first sight, suffix trees address this problem. To curb the memory usage of a suffix tree, one often prunes the tree to a certain depth. But this pruning method "takes away" more information from long strings than from short ones. This problem is particularly severe with sets of long strings, the setting studied here. In this article, we propose respective pruning techniques. Our approaches remove characters with low information value. The various variants determine a character\u27s information value in different ways, e.g., by using conditional entropy with respect to previous characters in the string. Our experiments show that, in contrast to the well-known pruned suffix tree, our technique provides significantly better estimations when the tree size is reduced by 60% or less. Due to the redundancy of natural language, our pruning techniques yield hardly any error for tree-size reductions of up to 50%

KITopen

Efficient String Dictionary Compression Using String Dictionaries

Author: Fuketa Masao
Kanda Shunsuke
Morita Kazuhiro
Publication venue: 日本データベース学会
Publication date: 09/11/2020
Field of study

文字列集合を保管するためのデータ構造である文字列辞書に関して，近年，多くの用途でコンパクト性が求められるという実例が報告されている．また，その背景に応じて，Trie や Front-Coding などの辞書を実現するための優れた技法に，Re-Pair などの強力な文書圧縮技法を組み合わせた圧縮文字列辞書が提案されている．本稿では，既存の圧縮文字列辞書の改良を目的とし，文字列辞書の圧縮に文字列辞書を用いるという方策に基づいた辞書構造を提案する．実データを用いた実験より，提案による文字列辞書はRe-Pair により圧縮した辞書と比べ，メモリ効率や検索・復元速度のトレードオフに関して同等の性能を示しつつ，短い時間で構築できることを示した．A string dictionary is a data structure to store a set of strings. Recently, instances have emerged in practice where the size of string dictionaries has become a critical problem in many applications. Consequently, compressed string dictionaries have been proposed by leveraging efficient implementation techniques, such as Trie and Front-Coding, and powerful text compression techniques, such as Re-Pair. In this paper, we propose new dictionary structures based on a strategy using string dictionaries for the compression in order to improve existing compressed ones. We show that our string dictionaries can be constructed in a shorter time compared to the Re-Pair versions with competitive space usage and operation speed, through experiments on real-world datasets

Tokushima University Institutional Repository

Indexes and Computation over Compressed Structured Data (Dagstuhl Seminar 13232)

Author: Maneth Sebastian
Navarro Gonzalo
Publication venue
Publication date: 01/01/2013
Field of study

This report documents the program and the outcomes of Dagstuhl Seminar 13232 "Indexes and Computation over Compressed Structured Data"

Edinburgh Research Explorer

Dagstuhl Research Online Publication Server

Top Tree Compression of Tries

Author: Bille Philip
Gawrychowski Pawel
Landau Gad M.
Weimann Oren
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th International Symposium on Algorithms and Computation (ISAAC 2019)
Publication date: 01/01/2019
Field of study

We present a compressed representation of tries based on top tree compression [ICALP 2013] that works on a standard, comparison-based, pointer machine model of computation and supports efficient prefix search queries. Namely, we show how to preprocess a set of strings of total length n over an alphabet of size sigma into a compressed data structure of worst-case optimal size O(n/log_sigma n) that given a pattern string P of length m determines if P is a prefix of one of the strings in time O(min(m log sigma,m + log n)). We show that this query time is in fact optimal regardless of the size of the data structure. Existing solutions either use Omega(n) space or rely on word RAM techniques, such as tabulation, hashing, address arithmetic, or word-level parallelism, and hence do not work on a pointer machine. Our result is the first solution on a pointer machine that achieves worst-case o(n) space. Along the way, we develop several interesting data structures that work on a pointer machine and are of independent interest. These include an optimal data structures for random access to a grammar-compressed string and an optimal data structure for a variant of the level ancestor problem

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Online Research Database In Technology