Towards a Definitive Measure of Repetitiveness

A Blumer; A Jez; A Lempel; AN Kolmogorov; AR Christiansen; CE Shannon; D Belazzougui; D Belazzougui; D Belazzougui; F Claude; F Claude; G Navarro; G Navarro; JA Storer; JC Kieffer; M Charikar; M Rodeh; P Bille; P Bille; P Bille; RM Karp; S Kreft; S Raskhodnikova; T Gagie; T Gagie; T Gagie; T Gagie; T Kida; V Mäkinen; W Rytter; ZD Stephens

Towards a Definitive Measure of Repetitiveness

Authors: A Blumer
A Jez
A Lempel
AN Kolmogorov
AR Christiansen
CE Shannon
D Belazzougui
D Belazzougui
D Belazzougui
F Claude
F Claude
G Navarro
G Navarro
JA Storer
JC Kieffer
M Charikar
M Rodeh
P Bille
P Bille
P Bille
RM Karp
S Kreft
S Raskhodnikova
T Gagie
T Gagie
T Gagie
T Gagie
T Kida
V Mäkinen
W Rytter
ZD Stephens
Publication date: 1 January 2020
Publisher: 'Springer Science and Business Media LLC'
Doi

Abstract

Unlike in statistical compression, where Shannon’s entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel–Ziv parse are frequently used to estimate repetitiveness. Recently, a more principled measure, the size γ of the smallest string attractor, was introduced. The measure γ lower bounds all the previous relevant ones (including z), yet length-n strings can be represented and efficiently indexed within space O(γlognγ), which also upper bounds most measures (including z). While γ is certainly a better measure of repetitiveness than z, it is NP-complete to compute, and no o(γlog n) -space representation of strings is known. In this paper, we study a smaller measure, δ≤ γ, which can be computed in linear time. We show that δ better captures the compressibility of repetitive strings. For every length n and every value δ≥ 2, we construct a string such that γ=Ω(δlognδ). Still, we show a representation of any string S in O(δlognδ) space that supports direct access to any character S[i] in time O(lognδ) and finds the occ occurrences of any pattern P[1.m] in time O(mlog n+ occlogεn) for any constant ε> 0. Further, we prove that no o(δlog n) -space representation exists: for every length n and every value 2 ≤ δ≤ n1-ε, we exhibit a string family whose elements can only be encoded in Ω(δlognδ) space. We complete our characterization of δ by showing that, although γ, z, and other repetitiveness measures are always O(δlognδ), for strings of any length n, the smallest context-free grammar can be of size Ω(δlog2n/ log log n). No such separation is known for γ

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Crossref

Last time updated on 11/08/2021

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

oai:iris.unive.it:10278/373491...

Last time updated on 12/10/2022