7 research outputs found

    On vocabulary size of grammar-based codes

    Full text link
    We discuss inequalities holding between the vocabulary size, i.e., the number of distinct nonterminal symbols in a grammar-based compression for a string, and the excess length of the respective universal code, i.e., the code-based analog of algorithmic mutual information. The aim is to strengthen inequalities which were discussed in a weaker form in linguistics but shed some light on redundancy of efficiently computable codes. The main contribution of the paper is a construction of universal grammar-based codes for which the excess lengths can be bounded easily.Comment: 5 pages, accepted to ISIT 2007 and correcte

    On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts

    Full text link
    The article presents a new interpretation for Zipf-Mandelbrot's law in natural language which rests on two areas of information theory. Firstly, we construct a new class of grammar-based codes and, secondly, we investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to prove a proposition with a simple informal statement: If a text of length nn describes nβn^\beta independent facts in a repetitive way then the text contains at least nβ/lognn^\beta/\log n different words, under suitable conditions on nn. In the formal statement, two modeling postulates are adopted. Firstly, the words are understood as nonterminal symbols of the shortest grammar-based encoding of the text. Secondly, the text is assumed to be emitted by a finite-energy strongly nonergodic source whereas the facts are binary IID variables predictable in a shift-invariant way.Comment: 24 pages, no figure

    Universal Densities Exist for Every Finite Reference Measure

    Full text link
    As it is known, universal codes, which estimate the entropy rate consistently, exist for stationary ergodic sources over finite alphabets but not over countably infinite ones. We generalize universal coding as the problem of universal densities with respect to a fixed reference measure on a countably generated measurable space. We show that universal densities, which estimate the differential entropy rate consistently, exist for finite reference measures. Thus finite alphabets are not necessary in some sense. To exhibit a universal density, we adapt the non-parametric differential (NPD) entropy rate estimator by Feutrill and Roughan. Our modification is analogous to Ryabko's modification of prediction by partial matching (PPM) by Cleary and Witten. Whereas Ryabko considered a mixture over Markov orders, we consider a mixture over quantization levels. Moreover, we demonstrate that any universal density induces a strongly consistent Ces\`aro mean estimator of conditional density given an infinite past. This yields a universal predictor with the 010-1 loss for a countable alphabet. Finally, we specialize universal densities to processes over natural numbers and on the real line. We derive sufficient conditions for consistent estimation of the entropy rate with respect to infinite reference measures in these domains.Comment: 28 pages, no figure

    Rate-Distortion via Markov Chain Monte Carlo

    Full text link
    We propose an approach to lossy source coding, utilizing ideas from Gibbs sampling, simulated annealing, and Markov Chain Monte Carlo (MCMC). The idea is to sample a reconstruction sequence from a Boltzmann distribution associated with an energy function that incorporates the distortion between the source and reconstruction, the compressibility of the reconstruction, and the point sought on the rate-distortion curve. To sample from this distribution, we use a `heat bath algorithm': Starting from an initial candidate reconstruction (say the original source sequence), at every iteration, an index i is chosen and the i-th sequence component is replaced by drawing from the conditional probability distribution for that component given all the rest. At the end of this process, the encoder conveys the reconstruction to the decoder using universal lossless compression. The complexity of each iteration is independent of the sequence length and only linearly dependent on a certain context parameter (which grows sub-logarithmically with the sequence length). We show that the proposed algorithms achieve optimum rate-distortion performance in the limits of large number of iterations, and sequence length, when employed on any stationary ergodic source. Experimentation shows promising initial results. Employing our lossy compressors on noisy data, with appropriately chosen distortion measure and level, followed by a simple de-randomization operation, results in a family of denoisers that compares favorably (both theoretically and in practice) with other MCMC-based schemes, and with the Discrete Universal Denoiser (DUDE).Comment: 35 pages, 16 figures, Submitted to IEEE Transactions on Information Theor

    On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts

    Get PDF
    The article presents a new interpretation for Zipf's law in natural language which relies on two areas of information theory. We reformulate the problem of grammar-based compression and investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to prove a proposition with a simple informal statement: If an nn-letter long text describes nβn^\beta independent facts in a random but consistent way then the text contains at least nβ/lognn^\beta/\log n different words. In the formal statement, two specific postulates are adopted. Firstly, the words are understood as the nonterminal symbols of the shortest grammar-based encoding of the text. Secondly, the texts are assumed to be emitted by a nonergodic source, with the described facts being binary IID variables that are asymptotically predictable in a shift-invariant way. The proof of the formal proposition applies several new tools. These are: a construction of universal grammar-based codes for which the differences of code lengths can be bounded easily, ergodic decomposition theorems for mutual information between the past and future of a stationary process, and a lemma that bounds differences of a sublinear function. The linguistic relevance of presented modeling assumptions, theorems, definitions, and examples is discussed in parallel.While searching for concrete processes to which our proposition can be applied, we introduce several instances of strongly nonergodic processes. In particular, we define the subclass of accessible description processes, which formalizes the notion of texts that describe facts in a self-contained way