72 research outputs found

    Text compression for Chinese documents.

    Get PDF
    by Chi-kwun Kan.Thesis (M.Phil.)--Chinese University of Hong Kong, 1995.Includes bibliographical references (leaves 133-137).Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Importance of Text Compression --- p.1Chapter 1.2 --- Historical Background of Data Compression --- p.2Chapter 1.3 --- The Essences of Data Compression --- p.4Chapter 1.4 --- Motivation and Objectives of the Project --- p.5Chapter 1.5 --- Definition of Important Terms --- p.6Chapter 1.5.1 --- Data Models --- p.6Chapter 1.5.2 --- Entropy --- p.10Chapter 1.5.3 --- Statistical and Dictionary-based Compression --- p.12Chapter 1.5.4 --- Static and Adaptive Modelling --- p.12Chapter 1.5.5 --- One-Pass and Two-Pass Modelling --- p.13Chapter 1.6 --- Benchmarks and Measurements of Results --- p.15Chapter 1.7 --- Sources of Testing Data --- p.16Chapter 1.8 --- Outline of the Thesis --- p.16Chapter 2 --- Literature Survey --- p.18Chapter 2.1 --- Data compression Algorithms --- p.18Chapter 2.1.1 --- Statistical Compression Methods --- p.18Chapter 2.1.2 --- Dictionary-based Compression Methods (Ziv-Lempel Fam- ily) --- p.23Chapter 2.2 --- Cascading of Algorithms --- p.33Chapter 2.3 --- Problems of Current Compression Programs on Chinese --- p.34Chapter 2.4 --- Previous Chinese Data Compression Literatures --- p.37Chapter 3 --- Chinese-related Issues --- p.38Chapter 3.1 --- Characteristics in Chinese Data Compression --- p.38Chapter 3.1.1 --- Large and Not Fixed Size Character Set --- p.38Chapter 3.1.2 --- Lack of Word Segmentation --- p.40Chapter 3.1.3 --- Rich Semantic Meaning of Chinese Characters --- p.40Chapter 3.1.4 --- Grammatical Variance of Chinese Language --- p.41Chapter 3.2 --- Definition of Different Coding Schemes --- p.41Chapter 3.2.1 --- Big5 Code --- p.42Chapter 3.2.2 --- GB (Guo Biao) Code --- p.43Chapter 3.2.3 --- Unicode --- p.44Chapter 3.2.4 --- HZ (Hanzi) Code --- p.45Chapter 3.3 --- Entropy of Chinese and Other Languages --- p.45Chapter 4 --- Huffman Coding on Chinese Text --- p.49Chapter 4.1 --- The use of the Chinese Character Identification Routine --- p.50Chapter 4.2 --- Result --- p.51Chapter 4.3 --- Justification of the Result --- p.53Chapter 4.4 --- Time and Memory Resources Analysis --- p.58Chapter 4.5 --- The Heuristic Order-n Huffman Coding for Chinese Text Com- pression --- p.61Chapter 4.5.1 --- The Algorithm --- p.62Chapter 4.5.2 --- Result --- p.63Chapter 4.5.3 --- Justification of the Result --- p.64Chapter 4.6 --- Chapter Conclusion --- p.66Chapter 5 --- The Ziv-Lempel Compression on Chinese Text --- p.67Chapter 5.1 --- The Chinese LZSS Compression --- p.68Chapter 5.1.1 --- The Algorithm --- p.69Chapter 5.1.2 --- Result --- p.73Chapter 5.1.3 --- Justification of the Result --- p.74Chapter 5.1.4 --- Time and Memory Resources Analysis --- p.80Chapter 5.1.5 --- Effects in Controlling the Parameters --- p.81Chapter 5.2 --- The Chinese LZW Compression --- p.92Chapter 5.2.1 --- The Algorithm --- p.92Chapter 5.2.2 --- Result --- p.94Chapter 5.2.3 --- Justification of the Result --- p.95Chapter 5.2.4 --- Time and Memory Resources Analysis --- p.97Chapter 5.2.5 --- Effects in Controlling the Parameters --- p.98Chapter 5.3 --- A Comparison of the performance of the LZSS and the LZW --- p.100Chapter 5.4 --- Chapter Conclusion --- p.101Chapter 6 --- Chinese Dictionary-based Huffman coding --- p.103Chapter 6.1 --- The Algorithm --- p.104Chapter 6.2 --- Result --- p.107Chapter 6.3 --- Justification of the Result --- p.108Chapter 6.4 --- Effects of Changing the Size of the Dictionary --- p.111Chapter 6.5 --- Chapter Conclusion --- p.114Chapter 7 --- Cascading of Huffman coding and LZW compression --- p.116Chapter 7.1 --- Static Cascading Model --- p.117Chapter 7.1.1 --- The Algorithm --- p.117Chapter 7.1.2 --- Result --- p.120Chapter 7.1.3 --- Explanation and Analysis of the Result --- p.121Chapter 7.2 --- Adaptive (Dynamic) Cascading Model --- p.125Chapter 7.2.1 --- The Algorithm --- p.125Chapter 7.2.2 --- Result --- p.126Chapter 7.2.3 --- Explanation and Analysis of the Result --- p.127Chapter 7.3 --- Chapter Conclusion --- p.128Chapter 8 --- Concluding Remarks --- p.129Chapter 8.1 --- Conclusion --- p.129Chapter 8.2 --- Future Work Direction --- p.130Chapter 8.2.1 --- Improvement in Efficiency and Resources Consumption --- p.130Chapter 8.2.2 --- The Compressibility of Chinese and Other Languages --- p.131Chapter 8.2.3 --- Use of Grammar Model --- p.131Chapter 8.2.4 --- Lossy Compression --- p.131Chapter 8.3 --- Epilogue --- p.132Bibliography --- p.13

    Compression of Spectral Images

    Get PDF

    Lecture Notes on Network Information Theory

    Full text link
    These lecture notes have been converted to a book titled Network Information Theory published recently by Cambridge University Press. This book provides a significantly expanded exposition of the material in the lecture notes as well as problems and bibliographic notes at the end of each chapter. The authors are currently preparing a set of slides based on the book that will be posted in the second half of 2012. More information about the book can be found at http://www.cambridge.org/9781107008731/. The previous (and obsolete) version of the lecture notes can be found at http://arxiv.org/abs/1001.3404v4/

    Optimum Implementation of Compound Compression of a Computer Screen for Real-Time Transmission in Low Network Bandwidth Environments

    Get PDF
    Remote working is becoming increasingly more prevalent in recent times. A large part of remote working involves sharing computer screens between servers and clients. The image content that is presented when sharing computer screens consists of both natural camera captured image data as well as computer generated graphics and text. The attributes of natural camera captured image data differ greatly to the attributes of computer generated image data. An image containing a mixture of both natural camera captured image and computer generated image data is known as a compound image. The research presented in this thesis focuses on the challenge of constructing a compound compression strategy to apply the ‘best fit’ compression algorithm for the mixed content found in a compound image. The research also involves analysis and classification of the types of data a given compound image may contain. While researching optimal types of compression, consideration is given to the computational overhead of a given algorithm because the research is being developed for real time systems such as cloud computing services, where latency has a detrimental impact on end user experience. The previous and current state of the art videos codec’s have been researched along many of the most current publishing’s from academia, to design and implement a novel approach to a low complexity compound compression algorithm that will be suitable for real time transmission. The compound compression algorithm will utilise a mixture of lossless and lossy compression algorithms with parameters that can be used to control the performance of the algorithm. An objective image quality assessment is needed to determine whether the proposed algorithm can produce an acceptable quality image after processing. Both traditional metrics such as Peak Signal to Noise Ratio will be used along with a new more modern approach specifically designed for compound images which is known as Structural Similarity Index will be used to define the quality of the decompressed Image. In finishing, the compression strategy will be tested on a set of generated compound images. Using open source software, the same images will be compressed with the previous and current state of the art video codec’s to compare the three main metrics, compression ratio, computational complexity and objective image quality

    The contour tree image encoding technique and file format

    Get PDF
    The process of contourization is presented which converts a raster image into a discrete set of plateaux or contours. These contours can be grouped into a hierarchical structure, defining total spatial inclusion, called a contour tree. A contour coder has been developed which fully describes these contours in a compact and efficient manner and is the basis for an image compression method. Simplification of the contour tree has been undertaken by merging contour tree nodes thus lowering the contour tree's entropy. This can be exploited by the contour coder to increase the image compression ratio. By applying general and simple rules derived from physiological experiments on the human vision system, lossy image compression can be achieved which minimises noticeable artifacts in the simplified image. The contour merging technique offers a complementary lossy compression system to the QDCT (Quantised Discrete Cosine Transform). The artifacts introduced by the two methods are very different; QDCT produces a general blurring and adds extra highlights in the form of overshoots, whereas contour merging sharpens edges, reduces highlights and introduces a degree of false contouring. A format based on the contourization technique which caters for most image types is defined, called the contour tree image format. Image operations directly on this compressed format have been studied which for certain manipulations can offer significant operational speed increases over using a standard raster image format. A couple of examples of operations specific to the contour tree format are presented showing some of the features of the new format.Science and Engineering Research Counci

    Text Augmentation: Inserting markup into natural language text with PPM Models

    Get PDF
    This thesis describes a new optimisation and new heuristics for automatically marking up XML documents. These are implemented in CEM, using PPMmodels. CEM is significantly more general than previous systems, marking up large numbers of hierarchical tags, using n-gram models for large n and a variety of escape methods. Four corpora are discussed, including the bibliography corpus of 14682 bibliographies laid out in seven standard styles using the BIBTEX system and markedup in XML with every field from the original BIBTEX. Other corpora include the ROCLING Chinese text segmentation corpus, the Computists’ Communique corpus and the Reuters’ corpus. A detailed examination is presented of the methods of evaluating mark up algorithms, including computation complexity measures and correctness measures from the fields of information retrieval, string processing, machine learning and information theory. A new taxonomy of markup complexities is established and the properties of each taxon are examined in relation to the complexity of marked-up documents. The performance of the new heuristics and optimisation is examined using the four corpora

    Lempel Ziv Welch data compression using associative processing as an enabling technology for real time application

    Get PDF
    Data compression is a term that refers to the reduction of data representation requirements either in storage and/or in transmission. A commonly used algorithm for compression is the Lempel-Ziv-Welch (LZW) method proposed by Terry A. Welch[l]. LZW is an adaptive, dictionary based, lossless algorithm. This provides for a general compression mechanism that is applicable to a broad range of inputs. Furthermore, the lossless nature of LZW implies that it is a reversible process which results in the original file/message being fully recoverable from compression. A variant of this algorithm is currently the foundation of the UNIX compress program. Additionally, LZW is one of the compression schemes defined in the TIFF standard[12], as well as in the CCITT V.42bis standard. One of the challenges in designing an efficient compression mechanism, such as LZW, which can be used in real time applications, is the speed of the search into the data dictionary. In this paper an Associative Processing implementation of the LZW algorithm is presented. This approach provides an efficient solution to this requirement. Additionally, it is shown that Associative Processing (ASP) allows for rapid and elegant development of the LZW algorithm that will generally outperform standard approaches in complexity, readability, and performance

    Implementation of Image Compression Algorithm using Verilog with Area, Power and Timing Constraints

    Get PDF
    Image compression is the application of Data compression on digital images. A fundamental shift in the image compression approach came after the Discrete Wavelet Transform (DWT) became popular. To overcome the inefficiencies in the JPEG standard and serve emerging areas of mobile and Internet communications, the new JPEG2000 standard has been developed based on the principles of DWT. An image compression algorithm was comprehended using Matlab code, and modified to perform better when implemented in hardware description language. Using Verilog HDL, the encoder for the image compression employing DWT was implemented. Detailed analysis for power, timing and area was done for Booth multiplier which forms the major building block in implementing DWT. The encoding technique exploits the zero tree structure present in the bitplanes to compress the transform coefficients

    Digital image compression

    Get PDF
    corecore