1,990 research outputs found
Universal Lossless Compression with Unknown Alphabets - The Average Case
Universal compression of patterns of sequences generated by independently
identically distributed (i.i.d.) sources with unknown, possibly large,
alphabets is investigated. A pattern is a sequence of indices that contains all
consecutive indices in increasing order of first occurrence. If the alphabet of
a source that generated a sequence is unknown, the inevitable cost of coding
the unknown alphabet symbols can be exploited to create the pattern of the
sequence. This pattern can in turn be compressed by itself. It is shown that if
the alphabet size is essentially small, then the average minimax and
maximin redundancies as well as the redundancy of every code for almost every
source, when compressing a pattern, consist of at least 0.5 log(n/k^3) bits per
each unknown probability parameter, and if all alphabet letters are likely to
occur, there exist codes whose redundancy is at most 0.5 log(n/k^2) bits per
each unknown probability parameter, where n is the length of the data
sequences. Otherwise, if the alphabet is large, these redundancies are
essentially at least O(n^{-2/3}) bits per symbol, and there exist codes that
achieve redundancy of essentially O(n^{-1/2}) bits per symbol. Two sub-optimal
low-complexity sequential algorithms for compression of patterns are presented
and their description lengths analyzed, also pointing out that the pattern
average universal description length can decrease below the underlying i.i.d.\
entropy for large enough alphabets.Comment: Revised for IEEE Transactions on Information Theor
Universal Compression of Power-Law Distributions
English words and the outputs of many other natural processes are well-known
to follow a Zipf distribution. Yet this thoroughly-established property has
never been shown to help compress or predict these important processes. We show
that the expected redundancy of Zipf distributions of order is
roughly the power of the expected redundancy of unrestricted
distributions. Hence for these orders, Zipf distributions can be better
compressed and predicted than was previously known. Unlike the expected case,
we show that worst-case redundancy is roughly the same for Zipf and for
unrestricted distributions. Hence Zipf distributions have significantly
different worst-case and expected redundancies, making them the first natural
distribution class shown to have such a difference.Comment: 20 page
Lower Bounds on the Redundancy of Huffman Codes with Known and Unknown Probabilities
In this paper we provide a method to obtain tight lower bounds on the minimum
redundancy achievable by a Huffman code when the probability distribution
underlying an alphabet is only partially known. In particular, we address the
case where the occurrence probabilities are unknown for some of the symbols in
an alphabet. Bounds can be obtained for alphabets of a given size, for
alphabets of up to a given size, and for alphabets of arbitrary size. The
method operates on a Computer Algebra System, yielding closed-form numbers for
all results. Finally, we show the potential of the proposed method to shed some
light on the structure of the minimum redundancy achievable by the Huffman
code
Universal Coding on Infinite Alphabets: Exponentially Decreasing Envelopes
This paper deals with the problem of universal lossless coding on a countable
infinite alphabet. It focuses on some classes of sources defined by an envelope
condition on the marginal distribution, namely exponentially decreasing
envelope classes with exponent . The minimax redundancy of
exponentially decreasing envelope classes is proved to be equivalent to
. Then a coding strategy is proposed, with
a Bayes redundancy equivalent to the maximin redundancy. At last, an adaptive
algorithm is provided, whose redundancy is equivalent to the minimax redundanc
- …