711 research outputs found
Empirical processes, typical sequences and coordinated actions in standard Borel spaces
This paper proposes a new notion of typical sequences on a wide class of
abstract alphabets (so-called standard Borel spaces), which is based on
approximations of memoryless sources by empirical distributions uniformly over
a class of measurable "test functions." In the finite-alphabet case, we can
take all uniformly bounded functions and recover the usual notion of strong
typicality (or typicality under the total variation distance). For a general
alphabet, however, this function class turns out to be too large, and must be
restricted. With this in mind, we define typicality with respect to any
Glivenko-Cantelli function class (i.e., a function class that admits a Uniform
Law of Large Numbers) and demonstrate its power by giving simple derivations of
the fundamental limits on the achievable rates in several source coding
scenarios, in which the relevant operational criteria pertain to reproducing
empirical averages of a general-alphabet stationary memoryless source with
respect to a suitable function class.Comment: 14 pages, 3 pdf figures; accepted to IEEE Transactions on Information
Theor
Universal lossless source coding with the Burrows Wheeler transform
The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n â â, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source
Universal Coding on Infinite Alphabets: Exponentially Decreasing Envelopes
This paper deals with the problem of universal lossless coding on a countable
infinite alphabet. It focuses on some classes of sources defined by an envelope
condition on the marginal distribution, namely exponentially decreasing
envelope classes with exponent . The minimax redundancy of
exponentially decreasing envelope classes is proved to be equivalent to
. Then a coding strategy is proposed, with
a Bayes redundancy equivalent to the maximin redundancy. At last, an adaptive
algorithm is provided, whose redundancy is equivalent to the minimax redundanc
About Adaptive Coding on Countable Alphabets: Max-Stable Envelope Classes
In this paper, we study the problem of lossless universal source coding for
stationary memoryless sources on countably infinite alphabets. This task is
generally not achievable without restricting the class of sources over which
universality is desired. Building on our prior work, we propose natural
families of sources characterized by a common dominating envelope. We
particularly emphasize the notion of adaptivity, which is the ability to
perform as well as an oracle knowing the envelope, without actually knowing it.
This is closely related to the notion of hierarchical universal source coding,
but with the important difference that families of envelope classes are not
discretely indexed and not necessarily nested.
Our contribution is to extend the classes of envelopes over which adaptive
universal source coding is possible, namely by including max-stable
(heavy-tailed) envelopes which are excellent models in many applications, such
as natural language modeling. We derive a minimax lower bound on the redundancy
of any code on such envelope classes, including an oracle that knows the
envelope. We then propose a constructive code that does not use knowledge of
the envelope. The code is computationally efficient and is structured to use an
{E}xpanding {T}hreshold for {A}uto-{C}ensoring, and we therefore dub it the
\textsc{ETAC}-code. We prove that the \textsc{ETAC}-code achieves the lower
bound on the minimax redundancy within a factor logarithmic in the sequence
length, and can be therefore qualified as a near-adaptive code over families of
heavy-tailed envelopes. For finite and light-tailed envelopes the penalty is
even less, and the same code follows closely previous results that explicitly
made the light-tailed assumption. Our technical results are founded on methods
from regular variation theory and concentration of measure
Estimation of the Rate-Distortion Function
Motivated by questions in lossy data compression and by theoretical
considerations, we examine the problem of estimating the rate-distortion
function of an unknown (not necessarily discrete-valued) source from empirical
data. Our focus is the behavior of the so-called "plug-in" estimator, which is
simply the rate-distortion function of the empirical distribution of the
observed data. Sufficient conditions are given for its consistency, and
examples are provided to demonstrate that in certain cases it fails to converge
to the true rate-distortion function. The analysis of its performance is
complicated by the fact that the rate-distortion function is not continuous in
the source distribution; the underlying mathematical problem is closely related
to the classical problem of establishing the consistency of maximum likelihood
estimators. General consistency results are given for the plug-in estimator
applied to a broad class of sources, including all stationary and ergodic ones.
A more general class of estimation problems is also considered, arising in the
context of lossy data compression when the allowed class of coding
distributions is restricted; analogous results are developed for the plug-in
estimator in that case. Finally, consistency theorems are formulated for
modified (e.g., penalized) versions of the plug-in, and for estimating the
optimal reproduction distribution.Comment: 18 pages, no figures [v2: removed an example with an error; corrected
typos; a shortened version will appear in IEEE Trans. Inform. Theory
Universal Densities Exist for Every Finite Reference Measure
As it is known, universal codes, which estimate the entropy rate
consistently, exist for stationary ergodic sources over finite alphabets but
not over countably infinite ones. We generalize universal coding as the problem
of universal densities with respect to a fixed reference measure on a countably
generated measurable space. We show that universal densities, which estimate
the differential entropy rate consistently, exist for finite reference
measures. Thus finite alphabets are not necessary in some sense. To exhibit a
universal density, we adapt the non-parametric differential (NPD) entropy rate
estimator by Feutrill and Roughan. Our modification is analogous to Ryabko's
modification of prediction by partial matching (PPM) by Cleary and Witten.
Whereas Ryabko considered a mixture over Markov orders, we consider a mixture
over quantization levels. Moreover, we demonstrate that any universal density
induces a strongly consistent Ces\`aro mean estimator of conditional density
given an infinite past. This yields a universal predictor with the loss
for a countable alphabet. Finally, we specialize universal densities to
processes over natural numbers and on the real line. We derive sufficient
conditions for consistent estimation of the entropy rate with respect to
infinite reference measures in these domains.Comment: 28 pages, no figure
- âŠ