975 research outputs found

    Prefix Codes for Power Laws with Countable Support

    Full text link
    In prefix coding over an infinite alphabet, methods that consider specific distributions generally consider those that decline more quickly than a power law (e.g., Golomb coding). Particular power-law distributions, however, model many random variables encountered in practice. For such random variables, compression performance is judged via estimates of expected bits per input symbol. This correspondence introduces a family of prefix codes with an eye towards near-optimal coding of known distributions. Compression performance is precisely estimated for well-known probability distributions using these codes and using previously known prefix codes. One application of these near-optimal codes is an improved representation of rational numbers.Comment: 5 pages, 2 tables, submitted to Transactions on Information Theor

    Zipf's law and L. Levin's probability distributions

    Full text link
    Zipf's law in its basic incarnation is an empirical probability distribution governing the frequency of usage of words in a language. As Terence Tao recently remarked, it still lacks a convincing and satisfactory mathematical explanation. In this paper I suggest that at least in certain situations, Zipf's law can be explained as a special case of the a priori distribution introduced and studied by L. Levin. The Zipf ranking corresponding to diminishing probability appears then as the ordering determined by the growing Kolmogorov complexity. One argument justifying this assertion is the appeal to a recent interpretation by Yu. Manin and M. Marcolli of asymptotic bounds for error--correcting codes in terms of phase transition. In the respective partition function, Kolmogorov complexity of a code plays the role of its energy. This version contains minor corrections and additions.Comment: 19 page

    Variable-Length Coding of Two-Sided Asymptotically Mean Stationary Measures

    Full text link
    We collect several observations that concern variable-length coding of two-sided infinite sequences in a probabilistic setting. Attention is paid to images and preimages of asymptotically mean stationary measures defined on subsets of these sequences. We point out sufficient conditions under which the variable-length coding and its inverse preserve asymptotic mean stationarity. Moreover, conditions for preservation of shift-invariant σ\sigma-fields and the finite-energy property are discussed and the block entropies for stationary means of coded processes are related in some cases. Subsequently, we apply certain of these results to construct a stationary nonergodic process with a desired linguistic interpretation.Comment: 20 pages. A few typos corrected after the journal publicatio

    Shannon Information and Kolmogorov Complexity

    Full text link
    We compare the elementary theories of Shannon information and Kolmogorov complexity, the extent to which they have a common purpose, and where they are fundamentally different. We discuss and relate the basic notions of both theories: Shannon entropy versus Kolmogorov complexity, the relation of both to universal coding, Shannon mutual information versus Kolmogorov (`algorithmic') mutual information, probabilistic sufficient statistic versus algorithmic sufficient statistic (related to lossy compression in the Shannon theory versus meaningful information in the Kolmogorov theory), and rate distortion theory versus Kolmogorov's structure function. Part of the material has appeared in print before, scattered through various publications, but this is the first comprehensive systematic comparison. The last mentioned relations are new.Comment: Survey, LaTeX 54 pages, 3 figures, Submitted to IEEE Trans Information Theor

    Large-alphabet sequence modelling - a comparative study

    Get PDF
    Most raw data is not binary, but over some often large and structured alphabet. Sometimes it is convenient to deal with binarised data sequence, but typically exploiting the original structure of the data significantly improves performance in many practical applications. In this thesis, we study Martin-Lof random sequences that are maximally incompressible and provide a topological view on the size of the set of random sequences. We also investigate the relationship between binary data compression techniques and modelling natural language text with the latter using raw unbinarised data sequence from a large alphabet. We perform an experimental comparative study for them, including an empirical comparison between Kneser-Ney (KN) variants with regular Context Tree Weighting algorithm (CTW) and phase CTW, and with large-alphabet CTW with different estimators. We also apply the idea of Hutter's adaptive sparse Dirichlet-multinomial coding to the KN method and provide a heuristic to make the discounting parameter adaptive. The KN with this adaptive discounting parameter outperforms the traditional KN method on the Large Calgary corpus

    On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts

    Full text link
    The article presents a new interpretation for Zipf-Mandelbrot's law in natural language which rests on two areas of information theory. Firstly, we construct a new class of grammar-based codes and, secondly, we investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to prove a proposition with a simple informal statement: If a text of length nn describes nβn^\beta independent facts in a repetitive way then the text contains at least nβ/lognn^\beta/\log n different words, under suitable conditions on nn. In the formal statement, two modeling postulates are adopted. Firstly, the words are understood as nonterminal symbols of the shortest grammar-based encoding of the text. Secondly, the text is assumed to be emitted by a finite-energy strongly nonergodic source whereas the facts are binary IID variables predictable in a shift-invariant way.Comment: 24 pages, no figure

    On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts

    Get PDF
    The article presents a new interpretation for Zipf's law in natural language which relies on two areas of information theory. We reformulate the problem of grammar-based compression and investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to prove a proposition with a simple informal statement: If an nn-letter long text describes nβn^\beta independent facts in a random but consistent way then the text contains at least nβ/lognn^\beta/\log n different words. In the formal statement, two specific postulates are adopted. Firstly, the words are understood as the nonterminal symbols of the shortest grammar-based encoding of the text. Secondly, the texts are assumed to be emitted by a nonergodic source, with the described facts being binary IID variables that are asymptotically predictable in a shift-invariant way. The proof of the formal proposition applies several new tools. These are: a construction of universal grammar-based codes for which the differences of code lengths can be bounded easily, ergodic decomposition theorems for mutual information between the past and future of a stationary process, and a lemma that bounds differences of a sublinear function. The linguistic relevance of presented modeling assumptions, theorems, definitions, and examples is discussed in parallel.While searching for concrete processes to which our proposition can be applied, we introduce several instances of strongly nonergodic processes. In particular, we define the subclass of accessible description processes, which formalizes the notion of texts that describe facts in a self-contained way

    Fair Testing

    Get PDF
    In this paper we present a solution to the long-standing problem of characterising the coarsest liveness-preserving pre-congruence with respect to a full (TCSP-inspired) process algebra. In fact, we present two distinct characterisations, which give rise to the same relation: an operational one based on a De Nicola-Hennessy-like testing modality which we call should-testing, and a denotational one based on a refined notion of failures. One of the distinguishing characteristics of the should-testing pre-congruence is that it abstracts from divergences in the same way as Milner¿s observation congruence, and as a consequence is strictly coarser than observation congruence. In other words, should-testing has a built-in fairness assumption. This is in itself a property long sought-after; it is in notable contrast to the well-known must-testing of De Nicola and Hennessy (denotationally characterised by a combination of failures and divergences), which treats divergence as catrastrophic and hence is incompatible with observation congruence. Due to these characteristics, should-testing supports modular reasoning and allows to use the proof techniques of observation congruence, but also supports additional laws and techniques. Moreover, we show decidability of should-testing (on the basis of the denotational characterisation). Finally, we demonstrate its advantages by the application to a number of examples, including a scheduling problem, a version of the Alternating Bit-protocol, and fair lossy communication channel
    corecore