71,890 research outputs found

    Reducing the loss of information through annealing text distortion

    Full text link
    Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Granados, A. ;Cebrian, M. ; Camacho, D. ; de Borja Rodriguez, F. "Reducing the Loss of Information through Annealing Text Distortion". IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 7 pp. 1090 - 1102, July 2011Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper, we take a step toward understanding compression distances by performing an experimental evaluation of the impact of several kinds of information distortion on compression-based text clustering. We show how progressively removing words in such a way that the complexity of a document is slowly reduced helps the compression-based text clustering and improves its accuracy. In fact, we show how the nondistorted text clustering can be improved by means of annealing text distortion. The experimental results shown in this paper are consistent using different data sets, and different compression algorithms belonging to the most important compression families: Lempel-Ziv, Statistical and Block-Sorting.This work was supported by the Spanish Ministry of Education and Science under TIN2010-19872 and TIN2010-19607 projects

    Performance Analysis of Multimedia Compression Algorithms

    Get PDF
    In this paper, we are evaluating the performance of Huffman and Run Length Encoding compression algorithms with multimedia data. We have used different types of multimedia formats such as images and text. Extensive experimentation with different file sizes was used to compare both algorithms evaluating the compression ratio and compression time. Huffman algorithm showed consistent performance compared to Run Length encoding

    On the Use of Suffix Arrays for Memory-Efficient Lempel-Ziv Data Compression

    Full text link
    Much research has been devoted to optimizing algorithms of the Lempel-Ziv (LZ) 77 family, both in terms of speed and memory requirements. Binary search trees and suffix trees (ST) are data structures that have been often used for this purpose, as they allow fast searches at the expense of memory usage. In recent years, there has been interest on suffix arrays (SA), due to their simplicity and low memory requirements. One key issue is that an SA can solve the sub-string problem almost as efficiently as an ST, using less memory. This paper proposes two new SA-based algorithms for LZ encoding, which require no modifications on the decoder side. Experimental results on standard benchmarks show that our algorithms, though not faster, use 3 to 5 times less memory than the ST counterparts. Another important feature of our SA-based algorithms is that the amount of memory is independent of the text to search, thus the memory that has to be allocated can be defined a priori. These features of low and predictable memory requirements are of the utmost importance in several scenarios, such as embedded systems, where memory is at a premium and speed is not critical. Finally, we point out that the new algorithms are general, in the sense that they are adequate for applications other than LZ compression, such as text retrieval and forward/backward sub-string search.Comment: 10 pages, submited to IEEE - Data Compression Conference 200

    Perbandingan Metode Lz77, Metode Huffman Dan Metode Deflate Terhadap Kompresi Data Teks

    Full text link
    Data compression is a very important process in the world that has been vastly using digital files, such as for texts, images, sounds or videos. Those digital files has a varied size and often taking disk storage spaces. To overcome this problem, many experts created compression algorithms, both for lossy and lossless compression. This research discusses about testing of four lossless compression algorithms that applied for text files, such as LZ77, Static Huffman, LZ77 combined with Static Huffman, and Deflate. Performance comparison of the four algorithms is measured by obtaining the compression ratio. From the test results can be concluded that the Deflate algorithm is the best algorithm due to the use of multiple modes, i.e. uncompressed mode, LZ77 combined with Static Huffman mode, and LZ77 combined with Dynamic Huffman Coding mode. The results also showed that the Deflate algorithm can compress text files and generates an average compression ratio of 38.84%

    Lempel-Ziv Parsing in External Memory

    Full text link
    For decades, computing the LZ factorization (or LZ77 parsing) of a string has been a requisite and computationally intensive step in many diverse applications, including text indexing and data compression. Many algorithms for LZ77 parsing have been discovered over the years; however, despite the increasing need to apply LZ77 to massive data sets, no algorithm to date scales to inputs that exceed the size of internal memory. In this paper we describe the first algorithm for computing the LZ77 parsing in external memory. Our algorithm is fast in practice and will allow the next generation of text indexes to be realised for massive strings and string collections.Comment: 10 page
    corecore