Search CORE

193 research outputs found

Faster Compact On-Line Lempel-Ziv Factorization

Author: Bannai Hideo
I Tomohiro
Inenaga Shunsuke
Takeda Masayuki
Yamamoto Jun'ichi
Publication venue
Publication date: 26/05/2013
Field of study

We present a new on-line algorithm for computing the Lempel-Ziv factorization of a string that runs in

O(N\log N)

time and uses only

O(N\log\sigma)

bits of working space, where

N

is the length of the string and

\sigma

is the size of the alphabet. This is a notable improvement compared to the performance of previous on-line algorithms using the same order of working space but running in either

O(N\log^3N)

time (Okanohara & Sadakane 2009) or

O(N\log^2N)

time (Starikovskaya 2012). The key to our new algorithm is in the utilization of an elegant but less popular index structure called Directed Acyclic Word Graphs, or DAWGs (Blumer et al. 1985). We also present an opportunistic variant of our algorithm, which, given the run length encoding of size

m

of a string of length

N

, computes the Lempel-Ziv factorization on-line, in

O\left(m \cdot \min \left\{\frac{(\log\log m)(\log \log N)}{\log\log\log N}, \sqrt{\frac{\log m}{\log \log m}} \right\}\right)

time and

O(m\log N)

bits of space, which is faster and more space efficient when the string is run-length compressible

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Lempel-Ziv Factorization May Be Harder Than Computing All Runs

Author: Kosolobov Dmitry
Publication venue
Publication date: 19/09/2014
Field of study

The complexity of computing the Lempel-Ziv factorization and the set of all runs (= maximal repetitions) is studied in the decision tree model of computation over ordered alphabet. It is known that both these problems can be solved by RAM algorithms in

O(n\log\sigma)

time, where

n

is the length of the input string and

\sigma

is the number of distinct letters in it. We prove an

\Omega(n\log\sigma)

lower bound on the number of comparisons required to construct the Lempel-Ziv factorization and thereby conclude that a popular technique of computation of runs using the Lempel-Ziv factorization cannot achieve an

o(n\log\sigma)

time bound. In contrast with this, we exhibit an

O(n)

decision tree algorithm finding all runs in a string. Therefore, in the decision tree model the runs problem is easier than the Lempel-Ziv factorization. Thus we support the conjecture that there is a linear RAM algorithm finding all runs.Comment: 12 pages, 3 figures, submitte

arXiv.org e-Print Archive

CiteSeerX

Dagstuhl Research Online Publication Server

A Grammar Compression Algorithm based on Induced Suffix Sorting

Author: Ayala-Rincón Mauricio
Gog Simon
Louza Felipe A.
Navarro Gonzalo
Nunes Daniel Saad Nogueira
Publication venue
Publication date: 08/11/2017
Field of study

We introduce GCIS, a grammar compression algorithm based on the induced suffix sorting algorithm SAIS, introduced by Nong et al. in 2009. Our solution builds on the factorization performed by SAIS during suffix sorting. We construct a context-free grammar on the input string which can be further reduced into a shorter string by substituting each substring by its correspondent factor. The resulting grammar is encoded by exploring some redundancies, such as common prefixes between suffix rules, which are sorted according to SAIS framework. When compared to well-known compression tools such as Re-Pair and 7-zip, our algorithm is competitive and very effective at handling repetitive string regarding compression ratio, compression and decompression running time

arXiv.org e-Print Archive

Crossref

Repositorio Académico de la Universidad de Chile

Fast online Lempel-Ziv factorization in compressed space

Author: Policriti Alberto
Prezza Nicola
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Let T be a text of length n on an alphabet \u3a3 of size \u3c3, and let H0 be the zero-order empirical entropy of T. We show that the LZ77 factorization of T can be computed in nH0+o(nlog\u3c3)+O(\u3c3logn) bits of working space with an online algorithm running in O(nlogn) time. Previous space-efficient online solutions either work in compact space and O(nlogn) time, or in succinct space and O(nlog3n) time

Archivio istituzionale della ricerca - Università degli Studi di Udine

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Universal Compressed Text Indexing

Author: Navarro Gonzalo
Prezza Nicola
Publication venue
Publication date: 06/09/2018
Field of study

The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let

\gamma

be the size of a string attractor for a text of length

n

. Our index takes

O(\gamma\log(n/\gamma))

words of space and supports locating the

occ

occurrences of any pattern of length

m

O(m\log n + occ\log^{\epsilon}n)

time, for any constant

\epsilon>0

. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Repositorio Académico de la Universidad de Chile

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma