7 research outputs found

    Context-based compression algorithms for text and image data.

    Get PDF
    Wong Ling.Thesis (M.Phil.)--Chinese University of Hong Kong, 1997.Includes bibliographical references (leaves 80-85).ABSTRACT --- p.1Chapter 1. --- INTRODUCTION --- p.2Chapter 1.1 --- motivation --- p.4Chapter 1.2 --- Original Contributions --- p.5Chapter 1.3 --- thesis Structure --- p.5Chapter 2. --- BACKGROUND --- p.7Chapter 2.1 --- information theory --- p.7Chapter 2.2 --- early compression --- p.8Chapter 2.2.1 --- Some Source Codes --- p.10Chapter 2.2.1.1 --- Huffman Code --- p.10Chapter 2.2.1.2 --- Tutstall Code --- p.10Chapter 2.2.1.3 --- Arithmetic Code --- p.11Chapter 2.3 --- modern techniques for compression --- p.14Chapter 2.3.1 --- Statistical Modeling --- p.14Chapter 2.3.1.1 --- Context Modeling --- p.15Chapter 2.3.1.2 --- State Based Modeling --- p.17Chapter 2.3.2 --- Dictionary Based Compression --- p.17Chapter 2.3.2.1 --- LZ-compression --- p.19Chapter 2.3.3 --- Other Compression Techniques --- p.20Chapter 2.3.3.1 --- Block Sorting --- p.20Chapter 2.3.3.2 --- Context Tree Weighting --- p.21Chapter 3. --- SYMBOL REMAPPING --- p.22Chapter 3. 1 --- reviews on Block Sorting --- p.22Chapter 3.1.1 --- Forward Transformation --- p.23Chapter 3.1.2 --- Inverse Transformation --- p.24Chapter 3.2 --- Ordering Method --- p.25Chapter 3.3 --- discussions --- p.27Chapter 4. --- CONTENT PREDICTION --- p.29Chapter 4.1 --- Prediction and Ranking Schemes --- p.29Chapter 4.1.1 --- Content Predictor --- p.29Chapter 4.1.2 --- Ranking Techn ique --- p.30Chapter 4.2 --- Reviews on Context Sorting --- p.31Chapter 4.2.1 --- Context Sorting basis --- p.31Chapter 4.3 --- General Framework of Content Prediction --- p.31Chapter 4.3.1 --- A Baseline Version --- p.32Chapter 4.3.2 --- Context Length Merge --- p.34Chapter 4.4 --- Discussions --- p.36Chapter 5. --- BOUNDED-LENGTH BLOCK SORTING --- p.38Chapter 5.1 --- block sorting with bounded context length --- p.38Chapter 5.1.1 --- Forward Transformation --- p.38Chapter 5.1.2 --- Reverse Transformation --- p.39Chapter 5.2 --- Locally Adaptive Entropy Coding --- p.43Chapter 5.3 --- discussion --- p.45Chapter 6. --- CONTEXT CODING FOR IMAGE DATA --- p.47Chapter 6.1 --- Digital Images --- p.47Chapter 6.1.1 --- Redundancy --- p.48Chapter 6.2 --- model of a compression system --- p.49Chapter 6.2.1 --- Representation --- p.49Chapter 6.2.2 --- Quantization --- p.50Chapter 6.2.3 --- Lossless coding --- p.51Chapter 6.3 --- The Embedded Zerotree Wavelet Coding --- p.51Chapter 6.3.1 --- Simple Zerotree-like Implementation --- p.53Chapter 6.3.2 --- Analysis of Zerotree Coding --- p.54Chapter 6.3.2.1 --- Linkage between Coefficients --- p.55Chapter 6.3.2.2 --- Design of Uniform Threshold Quantizer with Dead Zone --- p.58Chapter 6.4 --- Extensions on Wavelet Coding --- p.59Chapter 6.4.1 --- Coefficients Scanning --- p.60Chapter 6.5 --- Discussions --- p.61Chapter 7. --- CONCLUSIONS --- p.63Chapter 7.1 --- Future Research --- p.64APPENDIX --- p.65Chapter A --- Lossless Compression Results --- p.65Chapter B --- Image Compression Standards --- p.72Chapter C --- human Visual System Characteristics --- p.75Chapter D --- Lossy Compression Results --- p.76COMPRESSION GALLERY --- p.77Context-based Wavelet Coding --- p.75RD-OPT-based jpeg Compression --- p.76SPIHT Wavelet Compression --- p.77REFERENCES --- p.8

    Statistical and repetition-based compressed data structures

    Get PDF
    [Abstract] In this thesis we present several practical compressed data structures that address open problems related to statistically-compressible and highly repetitive databases. In a the first part, we focus on statistical-based compressed data structures, targeting the problem of managing large alphabets. This problem arises when typical sequence-based compression is used as a basis for compressed data structures representing more general structures like grids and graphs. Concretely, (a) we provide space-efficient solutions to represent prefix-free codes when the alphabet is large; (b) we also present a new wavelet-tree based data structure to solve rank and select queries that obtains zero-order compression and outperforms previous wavelet tree implementations on large alphabets. In the second part of this thesis, we focus on highly repetitive datasets. We present (c) a very space efficient grammar-based compressed data structure to solve rank and select on these scenarios; (d) the first LZ77-space bounded compressed data structure that solves rank and select queries in O(1) time and is in practice almost as fast as statistically-compressed structures; and (e) the first practical version of grammar-compressed tree topologies, obtaining unprecedented results in the representation of repetitive trees. Additionally, we apply our new solutions to several problems of interest: point grids, inverted indexes, self-indexes, XPath systems, and compressed suffix trees of highly repetitive inputs, displaying various space-time tradeoffs of interest.[Resumen] En esta tesis presentamos varias estructuras de datos comprimidas de naturaleza práctica, centradas en problemas abiertos relacionados con bases de datos estadísticamente compresibles y bases de datos cuyo contenido es altamente repetitivo. En la primera parte, nos centramos en las estructuras de datos comprimidas para bases de datos estadísticamente compresibles, más concretamente, en problemas relativos al manejo de alfabetos grandes. Este tipo de problemas aparecen cuando usamos técnicas clásicas de compresión estadística en estructuras de datos comprimidas para secuencias, y éstas a su vez se aplican a problemas tales como la representación de grillas de puntos o grafos. Concretamente, (a) presentamos soluciones muy eficientes en términos de espacio para representar códigos libres de prefijo cuando el alfabeto el grande; (b) y también presentamos una nueva estructura de datos comprimida basada en wavelet trees para resolver consultas rank y select que obtiene compresión de orden cero y mejora las implementaciones previas de wavelet trees en alfabetos grandes. En la segunda parte de esta tesis, nos centramos en las bases de datos altamente repetitivas. Presentamos (c) una estructura de datos comprimida basada en gramáticas para resolver consultas rank y select en este tipo de contextos y que usa muy poco espacio; (d) la primera estructura de datos comprimida que obtiene espacio proporcional al de un compresor LZ77 y resuelve consultas rank y select en tiempo O(1), siendo en la práctica casi tan rápido como las estructuras de datos basadas en compresión estadística; (e) la primera estructura de datos práctica que utiliza gramáticas para comprimir topologías de árboles, obteniendo resultados sin precedentes para la representación de árboles repetitivos. Adicionalmente, mostramos varias aplicaciones en las que las estructuras de datos que proponemos a lo largo de la tesis resultan de utilidad. Desde representaciones de grillas de puntos, índices invertidos, auto-índices, sistemas XPath, hasta árboles de sufijos comprimidos para colecciones altamente repetitivas, mostrando diferentes resultados de interés tanto en términos de tiempo como de espacio.[Resumo] Nesta tese presentamos varias estruturas de datos comprimidas de natureza práctica, centradas en problemas abertos no ámbito das bases de datos estatisticamente compresibles e das bases de datos altamente repetitivas. Na primeira parte da tese, centrámonos nas estruturas de datos comprimidas para as bases de datos estatisticamente compresibles. Máis concretamente en problemas relativos ó manexo de alfabetos grandes. Este tipo de problemas aparecen cando usamos técnicas de compresión estatística en estruturas de datos comprimidas para secuencias, e esta á sua vez se utilizan para aplicacións tales como a representación de grellas de puntos ou para a representación de grafos. Concretamente, (a) presentamos solucións que son moi eficientes en termos espaciais para representar códigos libres de prefixo cando o alfabeto é grande; e (b) tamén presentamos unha nova estructura de datos comprimida baseada en wavelet trees para resolver consultas rank e select que obtén compresión de orde cero e mellora as implementacións previas de wavelet trees para alfabetos grandes. Na segunda parte da tese, centrámosnos nas bases de datos con contido altamente repetitivo. Presentamos (c) unha estrutura de datos comprimida baseada en gramáticas que usa moi pouco espazo e resolve eficientemente consultas rank e select en este tipo de contextos repetitivos; (d) a primeira estrutura de datos comprimida que obtén espazo proporcional ó que obtén un compresor LZ77 e resolve consultas rank e select en tempo O(1), sendo na práctica tan rápido coma as estruturas de datos baseadas en compresión estatística; (e) a primeira estrutura de datos práctica que utiliza gramáticas para comprimir topoloxías de árbores, obtendo uns resultados sin precedentes para a representación de árbores repetitivos. Adicionalmente, mostramos varias aplicacións nas que as estruturas de datos que propoñemos ó longo da tese resultan de utilidade: representacións de grellas de puntos, índices invertidos, auto-índices, sistemas XPath e árbores de sufixos comprimidos para colecións altamente repetitivas, mostrando diferentes resultados de interese, tanto en termos de espazo coma de tempo

    Content-aware compression for big textual data analysis

    Get PDF
    A substantial amount of information on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. This thesis proposes techniques for more efficient textual big data analysis suitable for the Hadoop analytic platform. This research explores the direct processing of compressed textual data. The focus is on developing novel compression methods with a number of desirable properties to support text-based big data analysis in distributed environments. The novel contributions of this work include the following. Firstly, a Content-aware Partial Compression (CaPC) scheme is developed. CaPC makes a distinction between informational and functional content in which only the informational content is compressed. Thus, the compressed data is made transparent to existing software libraries which often rely on functional content to work. Secondly, a context-free bit-oriented compression scheme (Approximated Huffman Compression) based on the Huffman algorithm is developed. This uses a hybrid data structure that allows pattern searching in compressed data in linear time. Thirdly, several modern compression schemes have been extended so that the compressed data can be safely split with respect to logical data records in distributed file systems. Furthermore, an innovative two layer compression architecture is used, in which each compression layer is appropriate for the corresponding stage of data processing. Peripheral libraries are developed that seamlessly link the proposed compression schemes to existing analytic platforms and computational frameworks, and also make the use of the compressed data transparent to developers. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements

    28th Annual Symposium on Combinatorial Pattern Matching : CPM 2017, July 4-6, 2017, Warsaw, Poland

    Get PDF
    Peer reviewe

    Proceedings of the 26th International Symposium on Theoretical Aspects of Computer Science (STACS'09)

    Get PDF
    The Symposium on Theoretical Aspects of Computer Science (STACS) is held alternately in France and in Germany. The conference of February 26-28, 2009, held in Freiburg, is the 26th in this series. Previous meetings took place in Paris (1984), Saarbr¨ucken (1985), Orsay (1986), Passau (1987), Bordeaux (1988), Paderborn (1989), Rouen (1990), Hamburg (1991), Cachan (1992), W¨urzburg (1993), Caen (1994), M¨unchen (1995), Grenoble (1996), L¨ubeck (1997), Paris (1998), Trier (1999), Lille (2000), Dresden (2001), Antibes (2002), Berlin (2003), Montpellier (2004), Stuttgart (2005), Marseille (2006), Aachen (2007), and Bordeaux (2008). ..
    corecore