5,728 research outputs found
IDENTIFICATION OF COVER SONGS USING INFORMATION THEORETIC MEASURES OF SIMILARITY
13 pages, 5 figures, 4 tables. v3: Accepted version13 pages, 5 figures, 4 tables. v3: Accepted version13 pages, 5 figures, 4 tables. v3: Accepted versio
Compressed Text Indexes:From Theory to Practice!
A compressed full-text self-index represents a text in a compressed form and
still answers queries efficiently. This technology represents a breakthrough
over the text indexing techniques of the previous decade, whose indexes
required several times the size of the text. Although it is relatively new,
this technology has matured up to a point where theoretical research is giving
way to practical developments. Nonetheless this requires significant
programming skills, a deep engineering effort, and a strong algorithmic
background to dig into the research results. To date only isolated
implementations and focused comparisons of compressed indexes have been
reported, and they missed a common API, which prevented their re-use or
deployment within other applications.
The goal of this paper is to fill this gap. First, we present the existing
implementations of compressed indexes from a practitioner's point of view.
Second, we introduce the Pizza&Chili site, which offers tuned implementations
and a standardized API for the most successful compressed full-text
self-indexes, together with effective testbeds and scripts for their automatic
validation and test. Third, we show the results of our extensive experiments on
these codes with the aim of demonstrating the practical relevance of this novel
and exciting technology
Document Retrieval on Repetitive Collections
Document retrieval aims at finding the most important documents where a
pattern appears in a collection of strings. Traditional pattern-matching
techniques yield brute-force document retrieval solutions, which has motivated
the research on tailored indexes that offer near-optimal performance. However,
an experimental study establishing which alternatives are actually better than
brute force, and which perform best depending on the collection
characteristics, has not been carried out. In this paper we address this
shortcoming by exploring the relationship between the nature of the underlying
collection and the performance of current methods. Via extensive experiments we
show that established solutions are often beaten in practice by brute-force
alternatives. We also design new methods that offer superior time/space
trade-offs, particularly on repetitive collections.Comment: Accepted to ESA 2014. Implementation and experiments at
http://www.cs.helsinki.fi/group/suds/rlcsa
Universal lossless source coding with the Burrows Wheeler transform
The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n → ∞, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source
Pushdown Compression
The pressing need for eficient compression schemes for XML documents has
recently been focused on stack computation [6, 9], and in particular calls for
a formulation of information-lossless stack or pushdown compressors that allows
a formal analysis of their performance and a more ambitious use of the stack in
XML compression, where so far it is mainly connected to parsing mechanisms. In
this paper we introduce the model of pushdown compressor, based on pushdown
transducers that compute a single injective function while keeping the widest
generality regarding stack computation. The celebrated Lempel-Ziv algorithm
LZ78 [10] was introduced as a general purpose compression algorithm that
outperforms finite-state compressors on all sequences. We compare the
performance of the Lempel-Ziv algorithm with that of the pushdown compressors,
or compression algorithms that can be implemented with a pushdown transducer.
This comparison is made without any a priori assumption on the data's source
and considering the asymptotic compression ratio for infinite sequences. We
prove that Lempel-Ziv is incomparable with pushdown compressors
On Match Lengths, Zero Entropy and Large Deviations - with Application to Sliding Window Lempel-Ziv Algorithm
The Sliding Window Lempel-Ziv (SWLZ) algorithm that makes use of recurrence
times and match lengths has been studied from various perspectives in
information theory literature. In this paper, we undertake a finer study of
these quantities under two different scenarios, i) \emph{zero entropy} sources
that are characterized by strong long-term memory, and ii) the processes with
weak memory as described through various mixing conditions.
For zero entropy sources, a general statement on match length is obtained. It
is used in the proof of almost sure optimality of Fixed Shift Variant of
Lempel-Ziv (FSLZ) and SWLZ algorithms given in literature. Through an example
of stationary and ergodic processes generated by an irrational rotation we
establish that for a window of size , a compression ratio given by
where depends on and approaches 1 as
, is obtained under the application of FSLZ and SWLZ
algorithms. Also, we give a general expression for the compression ratio for a
class of stationary and ergodic processes with zero entropy.
Next, we extend the study of Ornstein and Weiss on the asymptotic behavior of
the \emph{normalized} version of recurrence times and establish the \emph{large
deviation property} (LDP) for a class of mixing processes. Also, an estimator
of entropy based on recurrence times is proposed for which large deviation
principle is proved for sources satisfying similar mixing conditions.Comment: accepted to appear in IEEE Transactions on Information Theor
- …