Search CORE

12 research outputs found

Universal lossless source coding with the Burrows Wheeler transform

Author: Effros Michelle
Kulkarni Sanjeev R.
Verdú Sergio
Visweswariah Karthik
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2002
Field of study

The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n → ∞, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source

CiteSeerX

Caltech Authors

Lossless Compression of Color Palette Images with One-Dimensional Techniques

Author: Arnavut Ziya
Sahin Ferat
Publication venue: RIT Scholar Works
Publication date: 01/04/2006
Field of study

Palette images are widely used on the World Wide Web (WWW) and in game-cartridge applications. Many images used on the WWW are stored and transmitted after they are compressed losslessly with the standard graphics interchange format (GIF), or portable network graphics (PNG). Well-known 2-D compression schemes, such as JPEG-LS and JPEG-2000, fail to yield better compression than GIF or PNG due to the fact that the pixel values represent indices that point to color values in a look-up table. To improve the compression performance of JPEG-LS and JPEG-2000 techniques, several researchers have proposed various reindexing algorithms. We investigate various compression techniques for color palette images. We propose a new technique comprised of a traveling salesman problem (TSP)-based reindexing scheme, Burrows-Wheeler transformation, and inversion ranks. We show that the proposed technique yields better compression gain on average than all the other 1-D compressors and the reindexing schemes that utilize JPEG-LS or JPEG-2000

RIT Scholar Works

Universal lossless source coding with the Burrows Wheeler transform

Author: Effros Michelle
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1999
Field of study

We here consider a theoretical evaluation of data compression algorithms based on the Burrows Wheeler transform (BWT). The main contributions include a variety of very simple new techniques for BWT-based universal lossless source coding on finite-memory sources and a set of new rate of convergence results for BWT-based source codes. The result is a theoretical validation and quantification of the earlier experimental observation that BWT-based lossless source codes give performance better than that of Ziv-Lempel-style codes and almost as good as that of prediction by partial mapping (PPM) algorithms

Reducing the Space Requirement of Suffix Trees

Author: Stefan Kurtz
Publication venue
Publication date: 01/01/1998
Field of study

We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained. Copyright © 1999 John Wiley & Sons, Ltd. KEY WORDS: data structures; suffix trees; implementation techniques; space reductio

CiteSeerX

Universal lossless source coding with the Burrows Wheeler transform

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1999
Field of study

Crossref

Transform Based And Search Aware Text Compression Schemes And Compressed Domain Text Retrieval

Author: Zhang Nan
Publication venue: University of Central Florida
Publication date: 01/01/2005
Field of study

In recent times, we have witnessed an unprecedented growth of textual information via the Internet, digital libraries and archival text in many applications. While a good fraction of this information is of transient interest, useful information of archival value will continue to accumulate. We need ways to manage, organize and transport this data from one point to the other on data communications links with limited bandwidth. We must also have means to speedily find the information we need from this huge mass of data. Sometimes, a single site may also contain large collections of data such as a library database, thereby requiring an efficient search mechanism even to search within the local data. To facilitate the information retrieval, an emerging ad hoc standard for uncompressed text is XML which preprocesses the text by putting additional user defined metadata such as DTD or hyperlinks to enable searching with better efficiency and effectiveness. This increases the file size considerably, underscoring the importance of applying text compression. On account of efficiency (in terms of both space and time), there is a need to keep the data in compressed form for as much as possible. Text compression is concerned with techniques for representing the digital text data in alternate representations that takes less space. Not only does it help conserve the storage space for archival and online data, it also helps system performance by requiring less number of secondary storage (disk or CD Rom) accesses and improves the network transmission bandwidth utilization by reducing the transmission time. Unlike static images or video, there is no international standard for text compression, although compressed formats like .zip, .gz, .Z files are increasingly being used. In general, data compression methods are classified as lossless or lossy. Lossless compression allows the original data to be recovered exactly. Although used primarily for text data, lossless compression algorithms are useful in special classes of images such as medical imaging, finger print data, astronomical images and data bases containing mostly vital numerical data, tables and text information. Many lossy algorithms use lossless methods at the final stage of the encoding stage underscoring the importance of lossless methods for both lossy and lossless compression applications. In order to be able to effectively utilize the full potential of compression techniques for the future retrieval systems, we need efficient information retrieval in the compressed domain. This means that techniques must be developed to search the compressed text without decompression or only with partial decompression independent of whether the search is done on the text or on some inversion table corresponding to a set of key words for the text. In this dissertation, we make the following contributions: (1) Star family compression algorithms: We have proposed an approach to develop a reversible transformation that can be applied to a source text that improves existing algorithm\u27s ability to compress. We use a static dictionary to convert the English words into predefined symbol sequences. These transformed sequences create additional context information that is superior to the original text. Thus we achieve some compression at the preprocessing stage. We have a series of transforms which improve the performance. Star transform requires a static dictionary for a certain size. To avoid the considerable complexity of conversion, we employ the ternary tree data structure that efficiently converts the words in the text to the words in the star dictionary in linear time. (2) Exact and approximate pattern matching in Burrows-Wheeler transformed (BWT) files: We proposed a method to extract the useful context information in linear time from the BWT transformed text. The auxiliary arrays obtained from BWT inverse transform brings logarithm search time. Meanwhile, approximate pattern matching can be performed based on the results of exact pattern matching to extract the possible candidate for the approximate pattern matching. Then fast verifying algorithm can be applied to those candidates which could be just small parts of the original text. We present algorithms for both k-mismatch and k-approximate pattern matching in BWT compressed text. A typical compression system based on BWT has Move-to-Front and Huffman coding stages after the transformation. We propose a novel approach to replace the Move-to-Front stage in order to extend compressed domain search capability all the way to the entropy coding stage. A modification to the Move-to-Front makes it possible to randomly access any part of the compressed text without referring to the part before the access point. (3) Modified LZW algorithm that allows random access and partial decoding for the compressed text retrieval: Although many compression algorithms provide good compression ratio and/or time complexity, LZW is the first one studied for the compressed pattern matching because of its simplicity and efficiency. Modifications on LZW algorithm provide the extra advantage for fast random access and partial decoding ability that is especially useful for text retrieval systems. Based on this algorithm, we can provide a dynamic hierarchical semantic structure for the text, so that the text search can be performed on the expected level of granularity. For example, user can choose to retrieve a single line, a paragraph, or a file, etc. that contains the keywords. More importantly, we will show that parallel encoding and decoding algorithm is trivial with the modified LZW. Both encoding and decoding can be performed with multiple processors easily and encoding and decoding process are independent with respect to the number of processors

CiteSeerX

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Recommended from our members

Higher Compression from the Burrows-Wheeler Transform with New Algorithms for the List Update Problem

Author: Chapin Brenton
Publication venue: 'University of North Texas Libraries'
Publication date: 01/08/2001
Field of study

Burrows-Wheeler compression is a three stage process in which the data is transformed with the Burrows-Wheeler Transform, then transformed with Move-To-Front, and finally encoded with an entropy coder. Move-To-Front, Transpose, and Frequency Count are some of the many algorithms used on the List Update problem. In 1985, Competitive Analysis first showed the superiority of Move-To-Front over Transpose and Frequency Count for the List Update problem with arbitrary data. Earlier studies due to Bitner assumed independent identically distributed data, and showed that while Move-To-Front adapts to a distribution faster, incurring less overwork, the asymptotic costs of Frequency Count and Transpose are less. The improvements to Burrows-Wheeler compression this work covers are increases in the amount, not speed, of compression. Best x of 2x-1 is a new family of algorithms created to improve on Move-To-Front's processing of the output of the Burrows-Wheeler Transform which is like piecewise independent identically distributed data. Other algorithms for both the middle stage of Burrows-Wheeler compression and the List Update problem for which overwork, asymptotic cost, and competitive ratios are also analyzed are several variations of Move One From Front and part of the randomized algorithm Timestamp. The Best x of 2x - 1 family includes Move-To-Front, the part of Timestamp of interest, and Frequency Count. Lastly, a greedy choosing scheme, Snake, switches back and forth as the amount of compression that two List Update algorithms achieves fluctuates, to increase overall compression. The Burrows-Wheeler Transform is based on sorting of contexts. The other improvements are better sorting orders, such as “aeioubcdf...” instead of standard alphabetical “abcdefghi...” on English text data, and an algorithm for computing orders for any data, and Gray code sorting instead of standard sorting. Both techniques lessen the overwork incurred by whatever List Update algorithms are used by reducing the difference between adjacent sorted contexts

UNT Digital Library

Tree models :algorithms and information theoretic properties

Author: Martín Alvaro
Publication venue: UR. FI-INCO,
Publication date
Field of study

La tesis estudia propiedades fundamentales y algoritmos relacionados con modelos árbol. Estos modelos requieren una cantidad relativamente pequeña de parámetros para representar fuentes de memoria finita (Markov) sobre alfabetos finitos, cuando el largo de la cantidad de símbolos pasados necesaria para determinar la distribución de probabilidad condicional del siguiente símbolo no es fija, sino que depende del contexto en el cual ocurre el símbolo. La tesis define estructuras combinatorias como árboles de contexto generalizados y sus clausuras FSM (del inglés finite state machine), y aplica estas estructuras para describir la primera implementación en tiempo lineal de codificación y decodificación de la versión semi-predictiva del algoritmo Context, un esquema doblemente universal que alcanza una tasa de convergencia óptima a la entropía en la clases de modelos árbol. La tesis analiza luego clases de tipo para modelos árbol, extendiendo el método de tipos previamente estudiado para modelos FSM. Se deriva una fórmula exacta para la cardinalidad de una clase de tipo para una secuencia de largo n dada, así como una estimación asintótica del valor esperado del logaritmo del tamaño de una clase de tipo, y una estimación asintótica del número de clases de tipo diferentes para secuencias de un largo dado. Estos resultados asintóticos se derivan con la ayuda del nuevo concepto de extensión canónica mínima de un árbol de contexto, un objeto combinatorio fundamental que se encuentra entre el árbol original y su clausura FSM. Como aplicaciones de las nuevas propiedades descubiertas para modelos árbol, se presentan algoritmos de codificación enumerativa doblemente universales y esquemas de simulación universal para secuencias individuales. Finalmente, la tesis presenta algunos problemas abiertos y direcciones para investigaciones futuras en esta área

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Using semantic knowledge to improve compression on log files

Author: Otten Frederick John
Publication venue: Faculty of Science, Computer Science
Publication date: 19/11/2008
Field of study

With the move towards global and multi-national companies, information technology infrastructure requirements are increasing. As the size of these computer networks increases, it becomes more and more difficult to monitor, control, and secure them. Networks consist of a number of diverse devices, sensors, and gateways which are often spread over large geographical areas. Each of these devices produce log files which need to be analysed and monitored to provide network security and satisfy regulations. Data compression programs such as gzip and bzip2 are commonly used to reduce the quantity of data for archival purposes after the log files have been rotated. However, there are many other compression programs which exist - each with their own advantages and disadvantages. These programs each use a different amount of memory and take different compression and decompression times to achieve different compression ratios. System log files also contain redundancy which is not necessarily exploited by standard compression programs. Log messages usually use a similar format with a defined syntax. In the log files, all the ASCII characters are not used and the messages contain certain "phrases" which often repeated. This thesis investigates the use of compression as a means of data reduction and how the use of semantic knowledge can improve data compression (also applying results to different scenarios that can occur in a distributed computing environment). It presents the results of a series of tests performed on different log files. It also examines the semantic knowledge which exists in maillog files and how it can be exploited to improve the compression results. The results from a series of text preprocessors which exploit this knowledge are presented and evaluated. These preprocessors include: one which replaces the timestamps and IP addresses with their binary equivalents and one which replaces words from a dictionary with unused ASCII characters. In this thesis, data compression is shown to be an effective method of data reduction producing up to 98 percent reduction in filesize on a corpus of log files. The use of preprocessors which exploit semantic knowledge results in up to 56 percent improvement in overall compression time and up to 32 percent reduction in compressed size.TeXpdfTeX-1.40.

South East Academic Libraries System (SEALS)

Rhodes Repository (SEALS)