Search CORE

899 research outputs found

Universal Compressed Text Indexing

Author: Navarro Gonzalo
Prezza Nicola
Publication venue
Publication date: 06/09/2018
Field of study

The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let

\gamma

be the size of a string attractor for a text of length

n

. Our index takes

O(\gamma\log(n/\gamma))

words of space and supports locating the

occ

occurrences of any pattern of length

m

O(m\log n + occ\log^{\epsilon}n)

time, for any constant

\epsilon>0

. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Repositorio Académico de la Universidad de Chile

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Pattern matching in Lempel-Ziv compressed strings: fast, simple, and deterministic

Author: Gawrychowski Pawel
Publication venue
Publication date: 01/01/2011
Field of study

Countless variants of the Lempel-Ziv compression are widely used in many real-life applications. This paper is concerned with a natural modification of the classical pattern matching problem inspired by the popularity of such compression methods: given an uncompressed pattern s[1..m] and a Lempel-Ziv representation of a string t[1..N], does s occur in t? Farach and Thorup gave a randomized O(nlog^2(N/n)+m) time solution for this problem, where n is the size of the compressed representation of t. We improve their result by developing a faster and fully deterministic O(nlog(N/n)+m) time algorithm with the same space complexity. Note that for highly compressible texts, log(N/n) might be of order n, so for such inputs the improvement is very significant. A (tiny) fragment of our method can be used to give an asymptotically optimal solution for the substring hashing problem considered by Farach and Muthukrishnan.Comment: submitte

arXiv.org e-Print Archive

CiteSeerX

Indexing Highly Repetitive String Collections

Author: Navarro Gonzalo
Publication venue
Publication date: 13/12/2021
Field of study

Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed in order to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey we cover the algorithmic developments that have led to these data structures. We describe the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that have been proposed, comparing them both in theoretical and practical aspects. We conclude with the current challenges in this fascinating field

arXiv.org e-Print Archive

テキスト圧縮に対する効率よい可変長-固定長符号化アルゴリズム

Author: 吉田諭史
Publication venue
Publication date: 25/03/2014
Field of study

Hokkaido University Collection of Scholarly and Academic Papers

Towards a Definitive Compressibility Measure for Repetitive Sequences

Author: Kociumaka Tomasz
Navarro Gonzalo
Prezza Nicola
Publication venue
Publication date: 15/01/2021
Field of study

Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size

z

of the Lempel--Ziv parse are frequently used to estimate it. The size

b \le z

of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute and it is not monotonic upon symbol appends. Recently, a more principled measure, the size

\gamma

of the smallest string \emph{attractor}, was introduced. The measure

\gamma \le b

lower bounds all the previous relevant ones, yet length-

n

strings can be represented and efficiently indexed within space

O(\gamma\log\frac{n}{\gamma})

, which also upper bounds most measures. While

\gamma

is certainly a better measure of repetitiveness than

b

, it is also NP-complete to compute and not monotonic, and it is unknown if one can always represent a string in

o(\gamma\log n)

space. In this paper, we study an even smaller measure,

\delta \le \gamma

, which can be computed in linear time, is monotonic, and allows encoding every string in

O(\delta\log\frac{n}{\delta})

space because

z = O(\delta\log\frac{n}{\delta})

. We show that

\delta

better captures the compressibility of repetitive strings. Concretely, we show that (1)

\delta

can be strictly smaller than

\gamma

, by up to a logarithmic factor; (2) there are string families needing

\Omega(\delta\log\frac{n}{\delta})

space to be encoded, so this space is optimal for every

n

and

\delta

; (3) one can build run-length context-free grammars of size

O(\delta\log\frac{n}{\delta})

, whereas the smallest (non-run-length) grammar can be up to

\Theta(\log n/\log\log n)

times larger; and (4) within

O(\delta\log\frac{n}{\delta})

space we can not only..

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

MPG.PuRe

Data comparison schemes for Pattern Recognition in Digital Images using Fractals

Author: Aburas Abdul Razag Ali
Publication venue: School of Computing Sciences, Department of Mathematical Sciences
Publication date: 01/07/1997
Field of study

Pattern recognition in digital images is a common problem with application in remote sensing, electron microscopy, medical imaging, seismic imaging and astrophysics for example. Although this subject has been researched for over twenty years there is still no general solution which can be compared with the human cognitive system in which a pattern can be recognised subject to arbitrary orientation and scale. The application of Artificial Neural Networks can in principle provide a very general solution providing suitable training schemes are implemented. However, this approach raises some major issues in practice. First, the CPU time required to train an ANN for a grey level or colour image can be very large especially if the object has a complex structure with no clear geometrical features such as those that arise in remote sensing applications. Secondly, both the core and file space memory required to represent large images and their associated data tasks leads to a number of problems in which the use of virtual memory is paramount. The primary goal of this research has been to assess methods of image data compression for pattern recognition using a range of different compression methods. In particular, this research has resulted in the design and implementation of a new algorithm for general pattern recognition based on the use of fractal image compression. This approach has for the first time allowed the pattern recognition problem to be solved in a way that is invariant of rotation and scale. It allows both ANNs and correlation to be used subject to appropriate pre-and post-processing techniques for digital image processing on aspect for which a dedicated programmer's work bench has been developed using X-Designer

De Montfort University Open Research Archive

At the Roots of Dictionary Compression : String Attractors

Author: A
Belazzougui D.
Bille P.
Nishimoto T.
Publication venue: ACM
Publication date: 01/01/2018
Field of study

Peer reviewe

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Helsingin yliopiston digitaalinen arkisto

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Online Research Database In Technology