38 research outputs found

    Two-Dimensional Source Coding by Means of Subblock Enumeration

    Full text link
    A technique of lossless compression via substring enumeration (CSE) attains compression ratios as well as popular lossless compressors for one-dimensional (1D) sources. The CSE utilizes a probabilistic model built from the circular string of an input source for encoding the source.The CSE is applicable to two-dimensional (2D) sources such as images by dealing with a line of pixels of 2D source as a symbol of an extended alphabet. At the initial step of the CSE encoding process, we need to output the number of occurrences of all symbols of the extended alphabet, so that the time complexity increase exponentially when the size of source becomes large. To reduce the time complexity, we propose a new CSE which can encode a 2D source in block-by-block instead of line-by-line. The proposed CSE utilizes the flat torus of an input 2D source as a probabilistic model for encoding the source instead of the circular string of the source. Moreover, we analyze the limit of the average codeword length of the proposed CSE for general sources.Comment: 5 pages, Submitted to ISIT201

    Compression by Substring Enumeration Using Sorted Contingency Tables

    Get PDF
    This paper proposes two variants of improved Compression by Substring Enumeration (CSE) with a finite alphabet. In previous studies on CSE, an encoder utilizes inequalities which evaluate the number of occurrences of a substring or a minimal forbidden word (MFW) to be encoded. The inequalities are derived from a contingency table including the number of occurrences of a substring or an MFW. Moreover, codeword length of a substring and an MFW grows with the difference between the upper and lower bounds deduced from the inequalities, however the lower bound is not tight. Therefore, we derive a new tight lower bound based on the contingency table and consequently propose a new CSE algorithm using the new inequality. We also propose a new encoding order of substrings and MFWs based on a sorted contingency table such that both its row and column marginal total are sorted in descending order instead of a lexicographical order used in previous studies. We then propose a new CSE algorithm which is the first proposed CSE algorithm using the new encoding order. Experimental results show that compression ratios of all files of the Calgary corpus in the proposed algorithms are better than those of a previous study on CSE with a finite alphabet. Moreover, compression ratios under the second proposed CSE get better than or equal to that under a well-known compressor for 11 files amongst 14 files in the corpus

    A Universal Two-Dimensional Source Coding by Means of Subblock Enumeration

    Get PDF
    The technique of lossless compression via substring enumeration (CSE) is a kind of enumerative code and uses a probabilistic model built from the circular string of an input source for encoding a one-dimensional (1D) source. CSE is applicable to two-dimensional (2D) sources, such as images, by dealing with a line of pixels of a 2D source as a symbol of an extended alphabet. At the initial step of CSE encoding process, we need to output the number of occurrences of all symbols of the extended alphabet, so that the time complexity increases exponentially when the size of source becomes large. To reduce computational time, we can rearrange pixels of a 2D source into a 1D source string along a space-filling curve like a Hilbert curve. However, information on adjacent cells in a 2D source may be lost in the conversion. To reduce the time complexity and compress a 2D source without converting to a 1D source, we propose a new CSE which can encode a 2D source in a block-by-block fashion instead of in a line-by-line fashion. The proposed algorithm uses the flat torus of an input 2D source as a probabilistic model instead of the circular string of the source. Moreover, we prove the asymptotic optimality of the proposed algorithm for 2D general sources

    Tilatiivis toteutus tiedon tiivistämiseen osamerkkijonoja luettelemalla

    Get PDF
    Häviöttömässä tiedon tiivistämisessä annetusta datasta luodaan tiiviste, joka vie mahdollisimman vähän tilaa suhteessa alkuperäiseen dataan. Tiivisteestä on voitava palauttaa identtinen kopio alkuperäisestä datasta. Tutkielmassa käsitellään häviötöntä tiivistysmenetelmää, joka tutkii tiivistettävää dataa, eli merkkijonoa tai tekstiä, kokonaisuutena, eikä esimerkiksi pieni osa kerrallaan. Menetelmä välittää tiivisteen purkajalle osamerkkijonojen esiintymismääriä tekstissä. Osamerkkijonot käsitellään ennalta tunnetussa järjestyksessä lyhyimmästä pisimpään, jolloin kumpikin osapuoli osaa liittää esiintymismäärän oikeaan osamerkkijonoon. Jotkut esiintymismäärät voivat olla nollia kertomassa, ettei osamerkkijono esiinny tekstissä. Tiivistyvyys saavutetaan huomaamalla, että aiemmin välitetyt osamerkkijonot rajaavat millaisia pidemmät merkkijonot voivat olla. Tällöin osa esiintymismääristä voidaan jättää välittämättä, tai välittämiseen käyttää vähemmän tilaa. Osamerkkijonoja, joiden esiintymismäärä täytyy välittää, karakterisoidaan maksimaalisuuden käsitteen avulla. Maksimaalisten osamerkkijonojen etsiminen ja osamerkkijonojen esiintymismäärien laskeminen paljaasta tekstistä on hidasta. Siksi teksti täytyy tallettaa tietorakenteeseen, joka tukee tarvittuja operaatioita tehden niistä nopeita. Tällaiset tietorakenteet vievät enemmän tilaa kuin paljas teksti. Koska tutkittavassa tiivistysmenetelmässä koko tiivistettävä teksti käsitellään kokonaisuutena, muistinkäytön tehokkuus korostuu. Tutkielmassa toteutetaan tiivistysmenetelmä käyttäen tilatiiviistä tietorakennetta nimeltä kaksisuuntainen BWT-indeksi. Tilatiiviit tietorakenteet vievät vain vähän enemmän tilaa, kuin niihin talletettu data. Tästä huolimatta ne toteuttavat talletettua dataa käsitteleviä operaatioita tehokkaasti. Toteutukselle suoritetut kokeet osoittavat muistinkäytön pysyvän kohtuullisena, jolloin suurempienkin tietomäärien tiivistys on mahdollista

    Tree models :algorithms and information theoretic properties

    Get PDF
    La tesis estudia propiedades fundamentales y algoritmos relacionados con modelos árbol. Estos modelos requieren una cantidad relativamente pequeña de parámetros para representar fuentes de memoria finita (Markov) sobre alfabetos finitos, cuando el largo de la cantidad de símbolos pasados necesaria para determinar la distribución de probabilidad condicional del siguiente símbolo no es fija, sino que depende del contexto en el cual ocurre el símbolo. La tesis define estructuras combinatorias como árboles de contexto generalizados y sus clausuras FSM (del inglés finite state machine), y aplica estas estructuras para describir la primera implementación en tiempo lineal de codificación y decodificación de la versión semi-predictiva del algoritmo Context, un esquema doblemente universal que alcanza una tasa de convergencia óptima a la entropía en la clases de modelos árbol. La tesis analiza luego clases de tipo para modelos árbol, extendiendo el método de tipos previamente estudiado para modelos FSM. Se deriva una fórmula exacta para la cardinalidad de una clase de tipo para una secuencia de largo n dada, así como una estimación asintótica del valor esperado del logaritmo del tamaño de una clase de tipo, y una estimación asintótica del número de clases de tipo diferentes para secuencias de un largo dado. Estos resultados asintóticos se derivan con la ayuda del nuevo concepto de extensión canónica mínima de un árbol de contexto, un objeto combinatorio fundamental que se encuentra entre el árbol original y su clausura FSM. Como aplicaciones de las nuevas propiedades descubiertas para modelos árbol, se presentan algoritmos de codificación enumerativa doblemente universales y esquemas de simulación universal para secuencias individuales. Finalmente, la tesis presenta algunos problemas abiertos y direcciones para investigaciones futuras en esta área

    Proceedings of the Fifth Workshop on Information Theoretic Methods in Science and Engineering

    Get PDF
    These are the online proceedings of the Fifth Workshop on Information Theoretic Methods in Science and Engineering (WITMSE), which was held in the Trippenhuis, Amsterdam, in August 2012

    Optimal information storage : nonsequential sources and neural channels

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.MIT Institute Archives copy: pages 101-163 bound in reverse order.Includes bibliographical references (p. 141-163).Information storage and retrieval systems are communication systems from the present to the future and fall naturally into the framework of information theory. The goal of information storage is to preserve as much signal fidelity under resource constraints as possible. The information storage theorem delineates average fidelity and average resource values that are achievable and those that are not. Moreover, observable properties of optimal information storage systems and the robustness of optimal systems to parameter mismatch may be determined. In this thesis, we study the physical properties of a neural information storage channel and also the fundamental bounds on the storage of sources that have nonsequential semantics. Experimental investigations have revealed that synapses in the mammalian brain possess unexpected properties. Adopting the optimization approach to biology, we cast the brain as an optimal information storage system and propose a theoretical framework that accounts for many of these physical properties. Based on previous experimental and theoretical work, we use volume as a limited resource and utilize the empirical relationship between volume anrid synaptic weight.(cont.) Our scientific hypotheses are based on maximizing information storage capacity per unit cost. We use properties of the capacity-cost function, e-capacity cost approximations, and measure matching to develop optimization principles. We find that capacity-achieving input distributions not only explain existing experimental measurements but also make non-trivial predictions about the physical structure of the brain. Numerous information storage applications have semantics such that the order of source elements is irrelevant, so the source sequence can be treated as a multiset. We formulate fidelity criteria that consider asymptotically large multisets and give conclusive, but trivialized, results in rate distortion theory. For fidelity criteria that consider fixed-size multisets. we give some conclusive results in high-rate quantization theory, low-rate quantization. and rate distortion theory. We also provide bounds on the rate-distortion function for other nonsequential fidelity criteria problems. System resource consumption can be significantly reduced by recognizing the correct invariance properties and semantics of the information storage task at hand.by Lav R. Varshney.S.M

    Sublinear Computation Paradigm

    Get PDF
    This open access book gives an overview of cutting-edge work on a new paradigm called the “sublinear computation paradigm,” which was proposed in the large multiyear academic research project “Foundations of Innovative Algorithms for Big Data.” That project ran from October 2014 to March 2020, in Japan. To handle the unprecedented explosion of big data sets in research, industry, and other areas of society, there is an urgent need to develop novel methods and approaches for big data analysis. To meet this need, innovative changes in algorithm theory for big data are being pursued. For example, polynomial-time algorithms have thus far been regarded as “fast,” but if a quadratic-time algorithm is applied to a petabyte-scale or larger big data set, problems are encountered in terms of computational resources or running time. To deal with this critical computational and algorithmic bottleneck, linear, sublinear, and constant time algorithms are required. The sublinear computation paradigm is proposed here in order to support innovation in the big data era. A foundation of innovative algorithms has been created by developing computational procedures, data structures, and modelling techniques for big data. The project is organized into three teams that focus on sublinear algorithms, sublinear data structures, and sublinear modelling. The work has provided high-level academic research results of strong computational and algorithmic interest, which are presented in this book. The book consists of five parts: Part I, which consists of a single chapter on the concept of the sublinear computation paradigm; Parts II, III, and IV review results on sublinear algorithms, sublinear data structures, and sublinear modelling, respectively; Part V presents application results. The information presented here will inspire the researchers who work in the field of modern algorithms

    Proceedings of the Fifth Workshop on Information Theoretic Methods in Science and Engineering (WITMSE-2012)

    Get PDF
    Peer reviewe
    corecore