899 research outputs found
Universal Compressed Text Indexing
The rise of repetitive datasets has lately generated a lot of interest in
compressed self-indexes based on dictionary compression, a rich and
heterogeneous family that exploits text repetitions in different ways. For each
such compression scheme, several different indexing solutions have been
proposed in the last two decades. To date, the fastest indexes for repetitive
texts are based on the run-length compressed Burrows-Wheeler transform and on
the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on
the other hand, are based on the Lempel-Ziv parsing and on grammar compression.
Indexes for more universal schemes such as collage systems and macro schemes
have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed
that all dictionary compressors can be interpreted as approximation algorithms
for the smallest string attractor, that is, a set of text positions capturing
all distinct substrings. Starting from this observation, in this paper we
develop the first universal compressed self-index, that is, the first indexing
data structure based on string attractors, which can therefore be built on top
of any dictionary-compressed text representation. Let be the size of a
string attractor for a text of length . Our index takes
words of space and supports locating the
occurrences of any pattern of length in
time, for any constant . This is, in particular, the first index
for general macro schemes and collage systems. Our result shows that the
relation between indexing and compression is much deeper than what was
previously thought: the simple property standing at the core of all dictionary
compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment
Pattern matching in Lempel-Ziv compressed strings: fast, simple, and deterministic
Countless variants of the Lempel-Ziv compression are widely used in many
real-life applications. This paper is concerned with a natural modification of
the classical pattern matching problem inspired by the popularity of such
compression methods: given an uncompressed pattern s[1..m] and a Lempel-Ziv
representation of a string t[1..N], does s occur in t? Farach and Thorup gave a
randomized O(nlog^2(N/n)+m) time solution for this problem, where n is the size
of the compressed representation of t. We improve their result by developing a
faster and fully deterministic O(nlog(N/n)+m) time algorithm with the same
space complexity. Note that for highly compressible texts, log(N/n) might be of
order n, so for such inputs the improvement is very significant. A (tiny)
fragment of our method can be used to give an asymptotically optimal solution
for the substring hashing problem considered by Farach and Muthukrishnan.Comment: submitte
Indexing Highly Repetitive String Collections
Two decades ago, a breakthrough in indexing string collections made it
possible to represent them within their compressed space while at the same time
offering indexed search functionalities. As this new technology permeated
through applications like bioinformatics, the string collections experienced a
growth that outperforms Moore's Law and challenges our ability of handling them
even in compressed form. It turns out, fortunately, that many of these rapidly
growing string collections are highly repetitive, so that their information
content is orders of magnitude lower than their plain size. The statistical
compression methods used for classical collections, however, are blind to this
repetitiveness, and therefore a new set of techniques has been developed in
order to properly exploit it. The resulting indexes form a new generation of
data structures able to handle the huge repetitive string collections that we
are facing.
In this survey we cover the algorithmic developments that have led to these
data structures. We describe the distinct compression paradigms that have been
used to exploit repetitiveness, the fundamental algorithmic ideas that form the
base of all the existing indexes, and the various structures that have been
proposed, comparing them both in theoretical and practical aspects. We conclude
with the current challenges in this fascinating field
Towards a Definitive Compressibility Measure for Repetitive Sequences
Unlike in statistical compression, where Shannon's entropy is a definitive
lower bound, no such clear measure exists for the compressibility of repetitive
sequences. Since statistical entropy does not capture repetitiveness, ad-hoc
measures like the size of the Lempel--Ziv parse are frequently used to
estimate it. The size of the smallest bidirectional macro scheme
captures better what can be achieved via copy-paste processes, though it is
NP-complete to compute and it is not monotonic upon symbol appends. Recently, a
more principled measure, the size of the smallest string
\emph{attractor}, was introduced. The measure lower bounds all
the previous relevant ones, yet length- strings can be represented and
efficiently indexed within space , which also
upper bounds most measures. While is certainly a better measure of
repetitiveness than , it is also NP-complete to compute and not monotonic,
and it is unknown if one can always represent a string in
space.
In this paper, we study an even smaller measure, , which
can be computed in linear time, is monotonic, and allows encoding every string
in space because . We show that better captures the
compressibility of repetitive strings. Concretely, we show that (1)
can be strictly smaller than , by up to a logarithmic factor; (2) there
are string families needing space to be
encoded, so this space is optimal for every and ; (3) one can build
run-length context-free grammars of size ,
whereas the smallest (non-run-length) grammar can be up to times larger; and (4) within
space we can not only..
Data comparison schemes for Pattern Recognition in Digital Images using Fractals
Pattern recognition in digital images is a common problem with application in
remote sensing, electron microscopy, medical imaging, seismic imaging and
astrophysics for example. Although this subject has been researched for over
twenty years there is still no general solution which can be compared with the
human cognitive system in which a pattern can be recognised subject to
arbitrary orientation and scale.
The application of Artificial Neural Networks can in principle provide a very
general solution providing suitable training schemes are implemented.
However, this approach raises some major issues in practice. First, the CPU
time required to train an ANN for a grey level or colour image can be very
large especially if the object has a complex structure with no clear geometrical
features such as those that arise in remote sensing applications. Secondly,
both the core and file space memory required to represent large images and
their associated data tasks leads to a number of problems in which the use of
virtual memory is paramount.
The primary goal of this research has been to assess methods of image data
compression for pattern recognition using a range of different compression
methods. In particular, this research has resulted in the design and
implementation of a new algorithm for general pattern recognition based on
the use of fractal image compression.
This approach has for the first time allowed the pattern recognition problem to
be solved in a way that is invariant of rotation and scale. It allows both ANNs
and correlation to be used subject to appropriate pre-and post-processing
techniques for digital image processing on aspect for which a dedicated
programmer's work bench has been developed using X-Designer
- …