7 research outputs found

    Information Theory over Multisets

    Get PDF
    Starting from Shannon theory of information, this paper presents the case of producing information in the form of multisets, and encoding information using multisets. We review the entropy rate of a multiset information source and derive a formula for the information content of a multiset. Then we study the encoder and channel part of the system, obtaining some results about multiset encoding length and channel capacity

    Compressing Sets and Multisets of Sequences

    Get PDF
    This is the accepted manuscript for a paper published in IEEE Transactions on Information Theory, Vol. 61, No. 3, March 2015, doi: 10.1109/TIT.2015.2392093. © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.This paper describes lossless compression algorithms for multisets of sequences, taking advantage of the multiset’s unordered structure. Multisets are a generalization of sets, where members are allowed to occur multiple times. A multiset can be encoded naïvely by simply storing its elements in some sequential order, but then information is wasted on the ordering. We propose a technique that transforms the multiset into an order-invariant tree representation, and derive an arithmetic code that optimally compresses the tree. Our method achieves compression even if the sequences in the multiset are individually incompressible (such as cryptographic hash sums). The algorithm is demonstrated practically by compressing collections of SHA-1 hash sums, and multisets of arbitrary, individually encodable objects.This work was supported in part by the Engineering and Physical Sciences Research Council under Grant EP/I036575 and in part by a Google Research Award. This paper was presented at the 2014 Data Compression Conferenc

    Benefiting from disorder: source coding for unordered data

    Full text link
    The order of letters is not always relevant in a communication task. This paper discusses the implications of order irrelevance on source coding, presenting results in several major branches of source coding theory: lossless coding, universal lossless coding, rate-distortion, high-rate quantization, and universal lossy coding. The main conclusions demonstrate that there is a significant rate savings when order is irrelevant. In particular, lossless coding of n letters from a finite alphabet requires Theta(log n) bits and universal lossless coding requires n + o(n) bits for many countable alphabet sources. However, there are no universal schemes that can drive a strong redundancy measure to zero. Results for lossy coding include distribution-free expressions for the rate savings from order irrelevance in various high-rate quantization schemes. Rate-distortion bounds are given, and it is shown that the analogue of the Shannon lower bound is loose at all finite rates.First author draf

    Optimal information storage : nonsequential sources and neural channels

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.MIT Institute Archives copy: pages 101-163 bound in reverse order.Includes bibliographical references (p. 141-163).Information storage and retrieval systems are communication systems from the present to the future and fall naturally into the framework of information theory. The goal of information storage is to preserve as much signal fidelity under resource constraints as possible. The information storage theorem delineates average fidelity and average resource values that are achievable and those that are not. Moreover, observable properties of optimal information storage systems and the robustness of optimal systems to parameter mismatch may be determined. In this thesis, we study the physical properties of a neural information storage channel and also the fundamental bounds on the storage of sources that have nonsequential semantics. Experimental investigations have revealed that synapses in the mammalian brain possess unexpected properties. Adopting the optimization approach to biology, we cast the brain as an optimal information storage system and propose a theoretical framework that accounts for many of these physical properties. Based on previous experimental and theoretical work, we use volume as a limited resource and utilize the empirical relationship between volume anrid synaptic weight.(cont.) Our scientific hypotheses are based on maximizing information storage capacity per unit cost. We use properties of the capacity-cost function, e-capacity cost approximations, and measure matching to develop optimization principles. We find that capacity-achieving input distributions not only explain existing experimental measurements but also make non-trivial predictions about the physical structure of the brain. Numerous information storage applications have semantics such that the order of source elements is irrelevant, so the source sequence can be treated as a multiset. We formulate fidelity criteria that consider asymptotically large multisets and give conclusive, but trivialized, results in rate distortion theory. For fidelity criteria that consider fixed-size multisets. we give some conclusive results in high-rate quantization theory, low-rate quantization. and rate distortion theory. We also provide bounds on the rate-distortion function for other nonsequential fidelity criteria problems. System resource consumption can be significantly reduced by recognizing the correct invariance properties and semantics of the information storage task at hand.by Lav R. Varshney.S.M

    Toward a Source Coding Theory for Sets

    No full text
    The problem of communicating (unordered) sets, rather than (ordered) sequences is formulated. Elementary results in all major branches of source coding theory, including lossless coding, high-rate and low-rate quantization, and rate distortion theory are presented. In certain scenarios, rate savings of log n! bits for sets of size n are obtained. Asymptotically in the set size, the entropy rate is zero and for sources with an ordered parent alphabet, the (0, 0) point is the rate distortion function.
    corecore