372 research outputs found

    Complexity-entropy analysis at different levels of organization in written language

    Full text link
    Written language is complex. A written text can be considered an attempt to convey a meaningful message which ends up being constrained by language rules, context dependence and highly redundant in its use of resources. Despite all these constraints, unpredictability is an essential element of natural language. Here we present the use of entropic measures to assert the balance between predictability and surprise in written text. In short, it is possible to measure innovation and context preservation in a document. It is shown that this can also be done at the different levels of organization of a text. The type of analysis presented is reasonably general, and can also be used to analyze the same balance in other complex messages such as DNA, where a hierarchy of organizational levels are known to exist

    Bicriteria data compression

    Get PDF
    The advent of massive datasets (and the consequent design of high-performing distributed storage systems) have reignited the interest of the scientific and engineering community towards the design of lossless data compressors which achieve effective compression ratio and very efficient decompression speed. Lempel-Ziv's LZ77 algorithm is the de facto choice in this scenario because of its decompression speed and its flexibility in trading decompression speed versus compressed-space efficiency. Each of the existing implementations offers a trade-off between space occupancy and decompression speed, so software engineers have to content themselves by picking the one which comes closer to the requirements of the application in their hands. Starting from these premises, and for the first time in the literature, we address in this paper the problem of trading optimally, and in a principled way, the consumption of these two resources by introducing the Bicriteria LZ77-Parsing problem, which formalizes in a principled way what data-compressors have traditionally approached by means of heuristics. The goal is to determine an LZ77 parsing which minimizes the space occupancy in bits of the compressed file, provided that the decompression time is bounded by a fixed amount (or vice-versa). This way, the software engineer can set its space (or time) requirements and then derive the LZ77 parsing which optimizes the decompression speed (or the space occupancy, respectively). We solve this problem efficiently in O(n log^2 n) time and optimal linear space within a small, additive approximation, by proving and deploying some specific structural properties of the weighted graph derived from the possible LZ77-parsings of the input file. The preliminary set of experiments shows that our novel proposal dominates all the highly engineered competitors, hence offering a win-win situation in theory&practice

    Indexing Highly Repetitive String Collections

    Full text link
    Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed in order to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey we cover the algorithmic developments that have led to these data structures. We describe the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that have been proposed, comparing them both in theoretical and practical aspects. We conclude with the current challenges in this fascinating field

    Document Retrieval on Repetitive Collections

    Full text link
    Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional pattern-matching techniques yield brute-force document retrieval solutions, which has motivated the research on tailored indexes that offer near-optimal performance. However, an experimental study establishing which alternatives are actually better than brute force, and which perform best depending on the collection characteristics, has not been carried out. In this paper we address this shortcoming by exploring the relationship between the nature of the underlying collection and the performance of current methods. Via extensive experiments we show that established solutions are often beaten in practice by brute-force alternatives. We also design new methods that offer superior time/space trade-offs, particularly on repetitive collections.Comment: Accepted to ESA 2014. Implementation and experiments at http://www.cs.helsinki.fi/group/suds/rlcsa

    Reducing the loss of information through annealing text distortion

    Full text link
    Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Granados, A. ;Cebrian, M. ; Camacho, D. ; de Borja Rodriguez, F. "Reducing the Loss of Information through Annealing Text Distortion". IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 7 pp. 1090 - 1102, July 2011Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper, we take a step toward understanding compression distances by performing an experimental evaluation of the impact of several kinds of information distortion on compression-based text clustering. We show how progressively removing words in such a way that the complexity of a document is slowly reduced helps the compression-based text clustering and improves its accuracy. In fact, we show how the nondistorted text clustering can be improved by means of annealing text distortion. The experimental results shown in this paper are consistent using different data sets, and different compression algorithms belonging to the most important compression families: Lempel-Ziv, Statistical and Block-Sorting.This work was supported by the Spanish Ministry of Education and Science under TIN2010-19872 and TIN2010-19607 projects

    Compressed Text Indexes:From Theory to Practice!

    Full text link
    A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications. The goal of this paper is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective testbeds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel and exciting technology

    Optimum Implementation of Compound Compression of a Computer Screen for Real-Time Transmission in Low Network Bandwidth Environments

    Get PDF
    Remote working is becoming increasingly more prevalent in recent times. A large part of remote working involves sharing computer screens between servers and clients. The image content that is presented when sharing computer screens consists of both natural camera captured image data as well as computer generated graphics and text. The attributes of natural camera captured image data differ greatly to the attributes of computer generated image data. An image containing a mixture of both natural camera captured image and computer generated image data is known as a compound image. The research presented in this thesis focuses on the challenge of constructing a compound compression strategy to apply the ‘best fit’ compression algorithm for the mixed content found in a compound image. The research also involves analysis and classification of the types of data a given compound image may contain. While researching optimal types of compression, consideration is given to the computational overhead of a given algorithm because the research is being developed for real time systems such as cloud computing services, where latency has a detrimental impact on end user experience. The previous and current state of the art videos codec’s have been researched along many of the most current publishing’s from academia, to design and implement a novel approach to a low complexity compound compression algorithm that will be suitable for real time transmission. The compound compression algorithm will utilise a mixture of lossless and lossy compression algorithms with parameters that can be used to control the performance of the algorithm. An objective image quality assessment is needed to determine whether the proposed algorithm can produce an acceptable quality image after processing. Both traditional metrics such as Peak Signal to Noise Ratio will be used along with a new more modern approach specifically designed for compound images which is known as Structural Similarity Index will be used to define the quality of the decompressed Image. In finishing, the compression strategy will be tested on a set of generated compound images. Using open source software, the same images will be compressed with the previous and current state of the art video codec’s to compare the three main metrics, compression ratio, computational complexity and objective image quality

    Indexes and Computation over Compressed Structured Data (Dagstuhl Seminar 13232)

    Get PDF
    This report documents the program and the outcomes of Dagstuhl Seminar 13232 "Indexes and Computation over Compressed Structured Data"

    Information Systems: Secure Access and Storage in the Age of Cloud Computing

    Get PDF
    Given that cloud computing is a remotely accessed service, the connection between provider and customer needs to be adequately protected against all known security risks. In order to ensure this, an open and clear specification of all standards, algorithms and security protocols adopted by the cloud provider is required. In this paper, we review current issues concerned with security threats to cloud computing and present a solution based on our unique patented compression-encryption method. The method provides highly efficient data compression where a unique symmetric key is generated as part of the compression process and is dependent on the characteristics of the data. Without the key, the data cannot be decompressed. We focus on threat prevention by cryptography that, if properly implemented, is virtually impossible to break directly. Our security by design is based on two principles: first, defence in depth, where our proposed design is such that more than one subsystem needs to be violated to get both the data and their key. Second, the principle of least privilege, where the attacker may gain access to only part of a system. The paper highlights the benefits of the solution that include high compression ratios, less bandwidth requirements, faster data transmission and response times, less storage space, and less energy consumption among others

    Compression of Textual Column-Oriented Data

    Get PDF
    Column-oriented data are well suited for compression. Since values of the same column are stored contiguously on disk, the information entropy is lower if compared to the physical data organization of conventional databases. There are many useful light-weight compression techniques targeted at specific data types and domains, like integers and small lists of distinct values, respectively. However, compression of textual values formed by skewed and high-cardinality words is usually restricted to variations of the LZ compression algorithm. So far there are no empirical evaluations that verify how other sophisticated compression methods address columnar data that store text. In this paper we shed a light on this subject by revisiting concepts of those algorithms. We also analyse how they behave in terms of compression and speed when dealing with textual columns where values appear in adjacent positions
    • …
    corecore