71,890 research outputs found
Reducing the loss of information through annealing text distortion
Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Granados, A. ;Cebrian, M. ; Camacho, D. ; de Borja Rodriguez, F. "Reducing the Loss of Information through Annealing Text Distortion". IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 7 pp. 1090 - 1102, July 2011Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper, we take a step toward understanding compression distances by performing an experimental evaluation of the impact of several kinds of information distortion on compression-based text clustering. We show how progressively removing words in such a way that the complexity of a document is slowly reduced helps the compression-based text clustering and improves its accuracy. In fact, we show how the nondistorted text clustering can be improved by means of annealing text distortion. The experimental results shown in this paper are consistent using different data sets, and different compression algorithms belonging to the most important compression families: Lempel-Ziv, Statistical and Block-Sorting.This work was supported by the Spanish Ministry of Education and Science under TIN2010-19872 and TIN2010-19607 projects
Performance Analysis of Multimedia Compression Algorithms
In this paper, we are evaluating the performance of Huffman and Run Length Encoding compression algorithms with multimedia data. We have used different types of multimedia formats such as images and text. Extensive experimentation with different file sizes was used to compare both algorithms evaluating the compression ratio and compression time. Huffman algorithm showed consistent performance compared to Run Length encoding
On the Use of Suffix Arrays for Memory-Efficient Lempel-Ziv Data Compression
Much research has been devoted to optimizing algorithms of the Lempel-Ziv
(LZ) 77 family, both in terms of speed and memory requirements. Binary search
trees and suffix trees (ST) are data structures that have been often used for
this purpose, as they allow fast searches at the expense of memory usage.
In recent years, there has been interest on suffix arrays (SA), due to their
simplicity and low memory requirements. One key issue is that an SA can solve
the sub-string problem almost as efficiently as an ST, using less memory. This
paper proposes two new SA-based algorithms for LZ encoding, which require no
modifications on the decoder side. Experimental results on standard benchmarks
show that our algorithms, though not faster, use 3 to 5 times less memory than
the ST counterparts. Another important feature of our SA-based algorithms is
that the amount of memory is independent of the text to search, thus the memory
that has to be allocated can be defined a priori. These features of low and
predictable memory requirements are of the utmost importance in several
scenarios, such as embedded systems, where memory is at a premium and speed is
not critical. Finally, we point out that the new algorithms are general, in the
sense that they are adequate for applications other than LZ compression, such
as text retrieval and forward/backward sub-string search.Comment: 10 pages, submited to IEEE - Data Compression Conference 200
Perbandingan Metode Lz77, Metode Huffman Dan Metode Deflate Terhadap Kompresi Data Teks
Data compression is a very important process in the world that has been vastly using digital files, such as for texts, images, sounds or videos. Those digital files has a varied size and often taking disk storage spaces. To overcome this problem, many experts created compression algorithms, both for lossy and lossless compression. This research discusses about testing of four lossless compression algorithms that applied for text files, such as LZ77, Static Huffman, LZ77 combined with Static Huffman, and Deflate. Performance comparison of the four algorithms is measured by obtaining the compression ratio. From the test results can be concluded that the Deflate algorithm is the best algorithm due to the use of multiple modes, i.e. uncompressed mode, LZ77 combined with Static Huffman mode, and LZ77 combined with Dynamic Huffman Coding mode. The results also showed that the Deflate algorithm can compress text files and generates an average compression ratio of 38.84%
Lempel-Ziv Parsing in External Memory
For decades, computing the LZ factorization (or LZ77 parsing) of a string has
been a requisite and computationally intensive step in many diverse
applications, including text indexing and data compression. Many algorithms for
LZ77 parsing have been discovered over the years; however, despite the
increasing need to apply LZ77 to massive data sets, no algorithm to date scales
to inputs that exceed the size of internal memory. In this paper we describe
the first algorithm for computing the LZ77 parsing in external memory. Our
algorithm is fast in practice and will allow the next generation of text
indexes to be realised for massive strings and string collections.Comment: 10 page
- …