85 research outputs found

    A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models

    Get PDF
    The development of efficient data compressors for DNA sequences is crucial not only for reducing the storage and the bandwidth for transmission, but also for analysis purposes. In particular, the development of improved compression models directly influences the outcome of anthropological and biomedical compression-based methods. In this paper, we describe a new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms. The reference-free method uses a competitive prediction model to estimate, for each symbol, the best class of models to be used before applying arithmetic encoding. There are two classes of models: weighted context models (including substitutional tolerant context models) and weighted stochastic repeat models. Both classes of models use specific sub-programs to handle inverted repeats efficiently. The results show that the proposed method attains a higher compression ratio than state-of-the-art approaches, on a balanced and diverse benchmark, using a competitive level of computational resources. An efficient implementation of the method is publicly available, under the GPLv3 license.Peer reviewe

    Prediction and evaluation of zero order entropy changes in grammar-based codes

    Get PDF
    The change of zero order entropy is studied over different strategies of grammar production rule selection. The two major rules are distinguished: transformations leaving the message size intact and substitution functions changing the message size. Relations for zero order entropy changes were derived for both cases and conditions under which the entropy decreases were described. In this article, several different greedy strategies reducing zero order entropy, as well as message sizes are summarized, and the new strategy MinEnt is proposed. The resulting evolution of the zero order entropy is compared with a strategy of selecting the most frequent digram used in the Re-Pair algorithm.Web of Science195art. no. 22

    Optimal LZ-End Parsing Is Hard

    Get PDF
    LZ-End is a variant of the well-known Lempel-Ziv parsing family such that each phrase of the parsing has a previous occurrence, with the additional constraint that the previous occurrence must end at the end of a previous phrase. LZ-End was initially proposed as a greedy parsing, where each phrase is determined greedily from left to right, as the longest factor that satisfies the above constraint [Kreft & Navarro, 2010]. In this work, we consider an optimal LZ-End parsing that has the minimum number of phrases in such parsings. We show that a decision version of computing the optimal LZ-End parsing is NP-complete by showing a reduction from the vertex cover problem. Moreover, we give a MAX-SAT formulation for the optimal LZ-End parsing adapting an approach for computing various NP-hard repetitiveness measures recently presented by [Bannai et al., 2022]. We also consider the approximation ratio of the size of greedy LZ-End parsing to the size of the optimal LZ-End parsing, and give a lower bound of the ratio which asymptotically approaches 2

    Comparison and model of compression techniques for smart cloud log file handling

    Get PDF
    Compression as data coding technique has seen approximately 70 years of research and practical innovation. Nowadays, powerful compression tools with good trade-offs exist for a range of file formats from plain text to rich multimedia. Yet in the dilemma of cloud providers to reduce log data sizes as much as possible while having to keep as much as possible around for regulatory reasons and compliance processes, many companies are looking for smarter solutions beyond brute compression. In this paper, comprehensive applied research setting around network and system logs is introduced by comparing text compression ratios and performance. The benchmark encompasses 13 tools and 30 tool-configuration-search combinations. The tool and algorithm relationships as well as benchmark results are modelled in a graph. After discussing the results, the paper reasons about limitations of individual approaches and suitable combinations of compression with smart adaptive log file handling. The adaptivity is based on the exploitation of knowledge on format-specific compression characteristics expressed in the graph, for which a proof-of-concept advisor service is provided

    Autonomous video compression system for environmental monitoring

    Full text link
    [EN] The monitoring of natural environments is becoming a very controversial topic because people are more and more concerned about preserving and monitoring these natural spaces. The monitoring tasks are usually complemented with a network infrastructure composed by cameras and network devices that make easy the remote visualization of the monitored environments. This work presents the design, implementation and test of an autonomous video compression system for environmental monitoring. The system is based on a server in charge of collecting the videos and analyzing the network constraints. As a function of the measured parameters and the predominant color of the requested video, the system determines the best compression codec for transmitting the video through the network. Additionally, the server should run an algorithm developed in Python and MATLAB(c) in charge of analyzing the RED-GREEN-BLUE (RGB) components of the video and performing the transcoding tasks. The system has been tested with different videos and the results of Quality of Service (QoS) and Quality of Experience (QoE) shows that H264 is a good option when the predominant color of videos are black or white while XVID is one the codecs that offer interesting results when colors as red, green or blue are predominant in the video.This work has been supported by the Programa para la Formación de Personal Investigador (FPI-2015-S2-884) by the Universitat Politecnica de Valencia . The research leading to these results has received funding from la Caixa Foundation and Triptolemos FoundationMateos-Cañas, I.; Sendra, S.; Lloret, J.; Jimenez, JM. (2017). Autonomous video compression system for environmental monitoring. Network Protocols and Algorithms. 9(1-2):48-70. https://doi.org/10.5296/npa.v9i1-2.12386S487091-

    Optimal Construction of Hierarchical Overlap Graphs

    Get PDF
    Genome assembly is a fundamental problem in Bioinformatics, where for a given set of overlapping substrings of a genome, the aim is to reconstruct the source genome. The classical approaches to solving this problem use assembly graphs, such as de Bruijn graphs or overlap graphs, which maintain partial information about such overlaps. For genome assembly algorithms, these graphs present a trade-off between overlap information stored and scalability. Thus, Hierarchical Overlap Graph (HOG) was proposed to overcome the limitations of both these approaches. For a given set P of n strings, the first algorithm to compute HOG was given by Cazaux and Rivals [IPL20] requiring O(||P||+n²) time using superlinear space, where ||P|| is the cumulative sum of the lengths of strings in P. This was improved by Park et al. [SPIRE20] to O(||P||log n) time and O(||P||) space using segment trees, and further to O(||P||(log n)/(log log n)) for the word RAM model. Both these results described an open problem to compute HOG in optimal O(||P||) time and space. In this paper, we achieve the desired optimal bounds by presenting a simple algorithm that does not use any complex data structures. At its core, our solution improves the classical result [IPL92] for a special case of the All Pairs Suffix Prefix (APSP) problem from O(||P||+n²) time to optimal O(||P||) time, which may be of independent interest.Peer reviewe

    Gated Linear Networks

    Full text link
    This paper presents a new family of backpropagation-free neural architectures, Gated Linear Networks (GLNs). What distinguishes GLNs from contemporary neural networks is the distributed and local nature of their credit assignment mechanism; each neuron directly predicts the target, forgoing the ability to learn feature representations in favor of rapid online learning. Individual neurons can model nonlinear functions via the use of data-dependent gating in conjunction with online convex optimization. We show that this architecture gives rise to universal learning capabilities in the limit, with effective model capacity increasing as a function of network size in a manner comparable with deep ReLU networks. Furthermore, we demonstrate that the GLN learning mechanism possesses extraordinary resilience to catastrophic forgetting, performing comparably to a MLP with dropout and Elastic Weight Consolidation on standard benchmarks. These desirable theoretical and empirical properties position GLNs as a complementary technique to contemporary offline deep learning methods.Comment: arXiv admin note: substantial text overlap with arXiv:1712.0189
    corecore