Search CORE

168 research outputs found

Lightweight Lempel-Ziv Parsing

Author: D. Okanohara
D. Okanohara
E. Ohlebusch
E. Ohlebusch
G. Chen
G. Navarro
G. Navarro
J. Barbay
J. Fischer
J. Kärkkäinen
J. Ziv
M. Crochemore
M.I. Abouelhoda
P. Ferragina
P. Ferragina
R. Cánovas
S. Kreft
S. Kuruppu
T. Gagie
T. Kasai
T. Starikovskaya
U. Manber
W.I. Chang
Publication venue
Publication date: 01/01/2013
Field of study

We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.Comment: 12 page

arXiv.org e-Print Archive

Crossref

Lempel-Ziv Parsing in External Memory

Author: Kempa Dominik
Kärkkäinen Juha
Puglisi Simon J.
Publication venue
Publication date: 04/07/2013
Field of study

For decades, computing the LZ factorization (or LZ77 parsing) of a string has been a requisite and computationally intensive step in many diverse applications, including text indexing and data compression. Many algorithms for LZ77 parsing have been discovered over the years; however, despite the increasing need to apply LZ77 to massive data sets, no algorithm to date scales to inputs that exceed the size of internal memory. In this paper we describe the first algorithm for computing the LZ77 parsing in external memory. Our algorithm is fast in practice and will allow the next generation of text indexes to be realised for massive strings and string collections.Comment: 10 page

arXiv.org e-Print Archive

Crossref

Bidirectional Text Compression in External Memory

Author: Dinklage Patrick
Ellert Jonas
Fischer Johannes
Penschuck Manuel
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 27th Annual European Symposium on Algorithms (ESA 2019)
Publication date: 01/01/2019
Field of study

Bidirectional compression algorithms work by substituting repeated substrings by references that, unlike in the famous LZ77-scheme, can point to either direction. We present such an algorithm that is particularly suited for an external memory implementation. We evaluate it experimentally on large data sets of size up to 128 GiB (using only 16 GiB of RAM) and show that it is significantly faster than all known LZ77 compressors, while producing a roughly similar number of factors. We also introduce an external memory decompressor for texts compressed with any uni- or bidirectional compression scheme

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Lempel-Ziv Compression in a Sliding Window

Author: Bille Philip
Cording Patrick Hagge
Fischer Johannes
Gørtz Inge Li
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2017
Field of study

Online Research Database In Technology

Lempel-Ziv Compression in a Sliding Window

Author: Bille Philip
Cording Patrick Hagge
Fischer Johannes
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)
Publication date: 01/01/2017
Field of study

We present new algorithms for the sliding window Lempel-Ziv (LZ77) problem and the approximate rightmost LZ77 parsing problem. Our main result is a new and surprisingly simple algorithm that computes the sliding window LZ77 parse in O(w) space and either O(n) expected time or O(n log log w+z log log s) deterministic time. Here, w is the window size, n is the size of the input string, z is the number of phrases in the parse, and s is the size of the alphabet. This matches the space and time bounds of previous results while removing constant size restrictions on the alphabet size. To achieve our result, we combine a simple modification and augmentation of the suffix tree with periodicity properties of sliding windows. We also apply this new technique to obtain an algorithm for the approximate rightmost LZ77 problem that uses O(n(log z + log log n)) time and O(n) space and produces a (1+e)-approximation of the rightmost parsing (any constant e>0). While this does not improve the best known time-space trade-offs for exact rightmost parsing, our algorithm is significantly simpler and exposes a direct connection between sliding window parsing and the approximate rightmost matching problem

Dagstuhl Research Online Publication Server

Practical Evaluation of Lempel-Ziv-78 and Lempel-Ziv-Welch Tries

Author: A Poyias
D Arroyuelo
D Lemire
D Lemire
D Lemire
G Marsaglia
GH Gonnet
H Bannai
H Luan
J Fischer
J Fischer
J Jansson
J Kärkkäinen
J Ziv
J Ziv
JA Feldman
JG Cleary
K Chung
L Carter
P Tchebychev
RM Karp
RM Robinson
TA Welch
Y Nakashima
Publication venue
Publication date: 09/06/2017
Field of study

We present the first thorough practical study of the Lempel-Ziv-78 and the Lempel-Ziv-Welch computation based on trie data structures. With a careful selection of trie representations we can beat well-tuned popular trie data structures like Judy, m-Bonsai or Cedar

arXiv.org e-Print Archive

Crossref

Indexing Highly Repetitive String Collections

Author: Navarro Gonzalo
Publication venue
Publication date: 13/12/2021
Field of study

Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed in order to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey we cover the algorithmic developments that have led to these data structures. We describe the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that have been proposed, comparing them both in theoretical and practical aspects. We conclude with the current challenges in this fascinating field

arXiv.org e-Print Archive

Computing Runs on a General Alphabet

Author: Kosolobov Dmitry
Publication venue
Publication date: 22/11/2015
Field of study

We describe a RAM algorithm computing all runs (maximal repetitions) of a given string of length

n

over a general ordered alphabet in

O(n\log^{\frac{2}3} n)

time and linear space. Our algorithm outperforms all known solutions working in

\Theta(n\log\sigma)

time provided

\sigma = n^{\Omega(1)}

, where

\sigma

is the alphabet size. We conjecture that there exists a linear time RAM algorithm finding all runs.Comment: 4 pages, 2 figure

arXiv.org e-Print Archive

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

String attractors and combinatorics on words

Author: Mantaci S.
Restivo A.
Romana G.
Rosone G.
Sciortino M.
Publication venue: CEUR-WS
Publication date: 01/01/2019
Field of study

The notion of string attractor has recently been introduced in [Prezza, 2017] and studied in [Kempa and Prezza, 2018] to provide a unifying framework for known dictionary-based compressors. A string attractor for a word w = w[1]w[2] · · · w[n] is a subset Γ of the positions 1, . . ., n, such that all distinct factors of w have an occurrence crossing at least one of the elements of Γ. While finding the smallest string attractor for a word is a NP-complete problem, it has been proved in [Kempa and Prezza, 2018] that dictionary compressors can be interpreted as algorithms approximating the smallest string attractor for a given word. In this paper we explore the notion of string attractor from a combinatorial point of view, by focusing on several families of finite words. The results presented in the paper suggest that the notion of string attractor can be used to define new tools to investigate combinatorial properties of the words

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Università di Palermo