7 research outputs found
On vocabulary size of grammar-based codes
We discuss inequalities holding between the vocabulary size, i.e., the number
of distinct nonterminal symbols in a grammar-based compression for a string,
and the excess length of the respective universal code, i.e., the code-based
analog of algorithmic mutual information. The aim is to strengthen inequalities
which were discussed in a weaker form in linguistics but shed some light on
redundancy of efficiently computable codes. The main contribution of the paper
is a construction of universal grammar-based codes for which the excess lengths
can be bounded easily.Comment: 5 pages, accepted to ISIT 2007 and correcte
On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts
The article presents a new interpretation for Zipf-Mandelbrot's law in
natural language which rests on two areas of information theory. Firstly, we
construct a new class of grammar-based codes and, secondly, we investigate
properties of strongly nonergodic stationary processes. The motivation for the
joint discussion is to prove a proposition with a simple informal statement: If
a text of length describes independent facts in a repetitive way
then the text contains at least different words, under
suitable conditions on . In the formal statement, two modeling postulates
are adopted. Firstly, the words are understood as nonterminal symbols of the
shortest grammar-based encoding of the text. Secondly, the text is assumed to
be emitted by a finite-energy strongly nonergodic source whereas the facts are
binary IID variables predictable in a shift-invariant way.Comment: 24 pages, no figure
Universal Densities Exist for Every Finite Reference Measure
As it is known, universal codes, which estimate the entropy rate
consistently, exist for stationary ergodic sources over finite alphabets but
not over countably infinite ones. We generalize universal coding as the problem
of universal densities with respect to a fixed reference measure on a countably
generated measurable space. We show that universal densities, which estimate
the differential entropy rate consistently, exist for finite reference
measures. Thus finite alphabets are not necessary in some sense. To exhibit a
universal density, we adapt the non-parametric differential (NPD) entropy rate
estimator by Feutrill and Roughan. Our modification is analogous to Ryabko's
modification of prediction by partial matching (PPM) by Cleary and Witten.
Whereas Ryabko considered a mixture over Markov orders, we consider a mixture
over quantization levels. Moreover, we demonstrate that any universal density
induces a strongly consistent Ces\`aro mean estimator of conditional density
given an infinite past. This yields a universal predictor with the loss
for a countable alphabet. Finally, we specialize universal densities to
processes over natural numbers and on the real line. We derive sufficient
conditions for consistent estimation of the entropy rate with respect to
infinite reference measures in these domains.Comment: 28 pages, no figure
Rate-Distortion via Markov Chain Monte Carlo
We propose an approach to lossy source coding, utilizing ideas from Gibbs
sampling, simulated annealing, and Markov Chain Monte Carlo (MCMC). The idea is
to sample a reconstruction sequence from a Boltzmann distribution associated
with an energy function that incorporates the distortion between the source and
reconstruction, the compressibility of the reconstruction, and the point sought
on the rate-distortion curve. To sample from this distribution, we use a `heat
bath algorithm': Starting from an initial candidate reconstruction (say the
original source sequence), at every iteration, an index i is chosen and the
i-th sequence component is replaced by drawing from the conditional probability
distribution for that component given all the rest. At the end of this process,
the encoder conveys the reconstruction to the decoder using universal lossless
compression. The complexity of each iteration is independent of the sequence
length and only linearly dependent on a certain context parameter (which grows
sub-logarithmically with the sequence length). We show that the proposed
algorithms achieve optimum rate-distortion performance in the limits of large
number of iterations, and sequence length, when employed on any stationary
ergodic source. Experimentation shows promising initial results. Employing our
lossy compressors on noisy data, with appropriately chosen distortion measure
and level, followed by a simple de-randomization operation, results in a family
of denoisers that compares favorably (both theoretically and in practice) with
other MCMC-based schemes, and with the Discrete Universal Denoiser (DUDE).Comment: 35 pages, 16 figures, Submitted to IEEE Transactions on Information
Theor
On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts
The article presents a new interpretation for Zipf's law in
natural language which relies on two areas of information
theory. We reformulate the problem of grammar-based compression
and investigate properties of strongly nonergodic stationary
processes. The motivation for the joint discussion is to prove a
proposition with a simple informal statement: If an -letter
long text describes independent facts in a random but
consistent way then the text contains at least
different words.
In the formal statement, two specific postulates are
adopted. Firstly, the words are understood as the nonterminal
symbols of the shortest grammar-based encoding of the
text. Secondly, the texts are assumed to be emitted by a
nonergodic source, with the described facts being binary IID
variables that are asymptotically predictable in a
shift-invariant way.
The proof of the formal proposition applies several new tools.
These are: a construction of universal grammar-based codes for
which the differences of code lengths can be bounded easily,
ergodic decomposition theorems for mutual information between the
past and future of a stationary process, and a lemma that bounds
differences of a sublinear function.
The linguistic relevance of presented modeling assumptions,
theorems, definitions, and examples is discussed in
parallel.While searching for concrete processes to which our
proposition can be applied, we introduce several instances of
strongly nonergodic processes. In particular, we define the
subclass of accessible description processes, which formalizes
the notion of texts that describe facts in a self-contained way