37 research outputs found

    Bicriteria data compression

    Get PDF
    The advent of massive datasets (and the consequent design of high-performing distributed storage systems) have reignited the interest of the scientific and engineering community towards the design of lossless data compressors which achieve effective compression ratio and very efficient decompression speed. Lempel-Ziv's LZ77 algorithm is the de facto choice in this scenario because of its decompression speed and its flexibility in trading decompression speed versus compressed-space efficiency. Each of the existing implementations offers a trade-off between space occupancy and decompression speed, so software engineers have to content themselves by picking the one which comes closer to the requirements of the application in their hands. Starting from these premises, and for the first time in the literature, we address in this paper the problem of trading optimally, and in a principled way, the consumption of these two resources by introducing the Bicriteria LZ77-Parsing problem, which formalizes in a principled way what data-compressors have traditionally approached by means of heuristics. The goal is to determine an LZ77 parsing which minimizes the space occupancy in bits of the compressed file, provided that the decompression time is bounded by a fixed amount (or vice-versa). This way, the software engineer can set its space (or time) requirements and then derive the LZ77 parsing which optimizes the decompression speed (or the space occupancy, respectively). We solve this problem efficiently in O(n log^2 n) time and optimal linear space within a small, additive approximation, by proving and deploying some specific structural properties of the weighted graph derived from the possible LZ77-parsings of the input file. The preliminary set of experiments shows that our novel proposal dominates all the highly engineered competitors, hence offering a win-win situation in theory&practice

    Relative Suffix Trees

    Get PDF
    Suffix trees are one of the most versatile data structures in stringology, with many applications in bioinformatics. Their main drawback is their size, which can be tens of times larger than the input sequence. Much effort has been put into reducing the space usage, leading ultimately to compressed suffix trees. These compressed data structures can efficiently simulate the suffix tree, while using space proportional to a compressed representation of the sequence. In this work, we take a new approach to compressed suffix trees for repetitive sequence collections, such as collections of individual genomes. We compress the suffix trees of individual sequences relative to the suffix tree of a reference sequence. These relative data structures provide competitive time/space trade-offs, being almost as small as the smallest compressed suffix trees for repetitive collections, and competitive in time with the largest and fastest compressed suffix trees.Peer reviewe

    Brotli: A General-Purpose Data Compressor

    Get PDF
    Brotli is an open source general-purpose data compressor introduced by Google in late 2013 and now adopted in most known browsers and Web servers. It is publicly available on GitHub and its data format was submitted as RFC 7932 in July 2016. Brotli is based on the Lempel-Ziv compression scheme and planned as a generic replacement of Gzip and ZLib. The main goal in its design was to compress data on the Internet, which meant optimizing the resources used at decoding time, while achieving maximal compression density. This article is intended to provide the first thorough, systematic description of the Brotli format as well as a detailed computational and experimental analysis of the main algorithmic blocks underlying the current encoder implementation, together with a comparison against compressors of different families constituting the state-of-the-art either in practice or in theory. This treatment will allow us to raise a set of new algorithmic and software engineering problems that deserve further attention from the scientific community

    Combined use of x-ray fluorescence microscopy, phase contrast imaging for high resolution quantitative iron mapping in inflamed cells

    Get PDF
    X-ray fluorescence microscopy (XRFM) is a powerful technique to detect and localize elements in cells. To derive information useful for biology and medicine, it is essential not only to localize, but also to map quantitatively the element concentration. Here we applied quantitative XRFM to iron in phagocytic cells. Iron, a primary component of living cells, can become toxic when present in excess. In human fluids, free iron is maintained at 10-18 M concentration thanks to iron binding proteins as lactoferrin (Lf). The iron homeostasis, involving the physiological ratio of iron between tissues/secretions and blood, is strictly regulated by ferroportin, the sole protein able to export iron from cells to blood. Inflammatory processes induced by lipopolysaccharide (LPS) or bacterial pathoge inhibit ferroportin synthesis in epithelial and phagocytic cells thus hindering iron export, increasing intracellular iron and bacterial multiplication. In this respect, Lf is emerging as an important regulator of both iron and inflammatory homeostasis. Here we studied phagocytic cells inflamed by bacterial LPS and untreated or treated with milk derived bovine Lf. Quantitative mapping of iron concentration and mass fraction at high spatial resolution is obtained combining X-ray fluorescence microscopy, atomic force microscopy and synchrotron phase contrast imaging

    On the use of optimization techniques for designing novel data compression and indexing schemes

    No full text
    The last few years have seen an exponential increase, driven by many disparate fields such as big data analytics, genomic technologies or even high-energy physics, of the amount of data that must be stored, accessed and analyzed. The possibility offered to the users by social media networks of easily creating and publishing contents gave rise to platforms managing hundred of petabytes, or the appearance of sequencers capable of producing terabytes of data at an inexpensive price, made possible by advancements in DNA sequencing technology, opened the road to fields such as pan-genomic analysis where common motifs have to be found in hundreds of genomes that must be first sequenced and indexed. Many of these developments, which operate on the data in an on-line fashion, have strict performance requirements. For example, web indexes must be capable of retrieving any indexed webpage in less than a millisecond on average in order to meet operational standards. This represents an incentive to store as much data as possible on main memory, instead of secondary storage like hard disks or solid-state disks. In fact, since accessing the RAM has a latency of around 100 nanoseconds, while accessing the disk has a latency of few milliseconds, even allowing just 1% of the memory references to be read from disk implies a slowdown of a factor of about 100. The most effective way of fitting more data in memory is to make use of data compression techniques. However, because of the on-line fashion these applications operate, these techniques must allow to operate over the data with an efficiency that is comparable to that of storing data in an uncompressed format. The kind of techniques that must be used to allow good compression ratios and fast data access depends on the way the data is accessed and on the kind of operations that must be performed on it. Applications can be grouped into two broad categories, depending on the nature of these operational -) applications that organize data into files that must be accessed on their entirety. Examples of this pattern are high-performing distributed storage systems like Google's BigTable, a distributed key-value database where each value is composed of chunks of 6464MiB data. Applications in this scenario, also known as "compress once, decompress many times", need a good trade-off between compression ratio and decompression speed. -) applications that need to perform sophisticated queries on their dataset. Examples of this class of applications are web indexes, where a search query needs to fetch only those documents containing a user-supplied list of words, or DNA aligners, where portions of the text that match a given pattern must be found. In this kind of applications, only a small fraction of the data should be accessed, so the compressed representation should allow for some kind of direct access without requiring to decompress the whole text, which is costly. Theoretical solutions for these needs abound in the literature. For "compress once, decompress many times" scenarios, there are compressors based either on the Lempel-Ziv parsing scheme or on the Burrows-Wheeler Transform that are asymptotically optimal both in time and space, while for the second category Compressed Full-Text Indexes such as the FM-index allow powerful queries on the compressed representation with almost optimal time complexities. However, new applications provide plenty of new challenges, such as the need of specific time/space trade-offs that are not provided by existing solutions or the necessity of new, compact storage systems that exploit the specialties of new datasets to support fast queries on it. In this thesis, our contributions focused on developing tools and techniques that can be helpful in addressing these new challenges. Efficient and usable space-optimal compressor. Even though the LZ77 algorithm described by Lempel and Ziv is asymptotically optimal in a strong information-theory sense, it does not necessarily yield the lowest possible compression ratio achievable for every possible string. In fact, an active line of research, both in the industry and in academia, focused on improving the achieved compression ratio attainable by a LZ77 scheme for each individual string. Ferragina, Nitto and Venturini introduced the bit-optimal LZ77 parsing problem, the problem of constructing the most succinct LZ77 compression of any given text, and illustrated an efficient construction algorithm. Their algorithm assumes that universal codes are used for compressing the LZ77 phrases in the parsing. The algorithm illustrated in their work, albeit theoretically efficient when specific coders are used, is not practical because it involves sophisticated transformations on cache-unfriendly data structures, so it is difficult to implement and likely slow in practice. In this chapter we show a practical, memory-friendly algorithm that matches the time and space complexity of the original solution. Bicriteria data compressor. In the industry, many data compressors have been developed to trade, in a very specific way, compression ratio for decompression speed. Unfortunately, these approaches are ad-hoc and thus not readily applicable to other applicative scenarios, a situation that originated in the development of myriads of different data compressors, each targeting a different performance profile. In this chapter we target this issue by introducing the Bicriteria Data Compression problem, which asks for the most succinct LZ77 compression that can be decompressed in a given time budget, or vice-versa. Similarly to the bit-optimal LZ77 parsing problem, we assume that phrases are compressed using universal codes. The problem is modeled as a Weight-Constrained Shortest Path Problem on a directed, acyclic graph. We then show an algorithm that solves efficiently the problem by exploiting some peculiarities of that graph. Nicely, the solution reduces the problem to the resolution of a small number of instances of the bit-optimal LZ77 problem, which implies in turn the existence of an efficient and practical construction algorithm. An efficient, engineered data compressor In this chapter we illustrate bczip, an engineered implementation of the efficient bit-optimal and bicriteria data compressors introduced in the previous chapters. We illustrate some algorithmic improvements that aim at making the algorithm even more efficient in practice. We also show that conventional means of providing some decompression time/compression ratio trade-offs on LZ77-compressors are not consistent and sometimes even detrimental, a result which further validates our approach. The benchmarks show that the proposed compressor is extremely competitive with the state-of-the-art when compression ratio and decompression speeds are considered together. Succinct index for relative data structures On biological applications, there is a need of indexing a great number of documents that are very similar. Because of this, a new line of research has focused on building relative compressed indexes, that is, compressed indexes that are expressed relatively to the index of a similar document in order to save space. These approaches exploit the similarities of the data structures underneath these indexes to lower their total space consumption. For example, the Relative FM-Index exploits the fact that the BWT transform of two similar documents are similar, and thus expresses a BWT as the difference of the reference. In this chapter we propose a new relative Lempel-Ziv representation that can compactly represent all differences among similar documents and we show a very efficient implementation supporting fast random access to the (relatively) compressed input text. Being a general solution, this approach can be used to help designing new relative data structures for similar settings

    Multi-Objective Optimization Design for LZSS

    No full text
    In this thesis we explore the idea of controlling the LZSS compression ratio/decompression speed trade-off in a principled way. Our approach lies on two ideas. The first one is to model a LZSS parsing of a text T as a path on a (weighted) graph G(T), which has a vertex for each character in T and an edge for each phrase in the dictionary. The second idea is to model explicitly the amount of compression space and decompression time taken up by a parsing through resource models. Both ideas are not new: the notion of transposing LZ77 parsings to paths in a suitably defined graph G has been originally illustrated by Schuegraf, while Ferragina et al. labeled each edge of G with the cost in bits of the associated codeword. In this thesis, we further extend G by labeling each edge with the encoding size in bits and the decoding time of the codeword, as given by the resource models. This model allows us to define, in a precise way, the time-constrained LZSS parsing problem, namely, the problem of determining the parsing which minimizes the compressed size given a bound on the decompression time, as a constrained single source shortest path problem on G. The proposed strategy to attack the problem is an heuristic based on the computation of the Lagrangian dual, i.e., the computation of the “best” Lagrangian relaxation of the time constraint. In this way we reduce the problem to computing several times the bit-optimal LZSS parsing problem, so we can take advantage of its efficient resolution algorithms. Next, we apply these ideas in the implementation of a compressor which employs the time-constrained LZSS strategy. The treatment includes the description of a fast encoder and the derivation of an accurate time model for the target processor, a Core 2 Duo P8600. The experimental results show that the idea is promising. In fact, the compressor exhibits remarkable performances in its capacity of controlling the time / space trade-off, mainly due to the high accuracy of the time model and the excellent performances of the Lagrangian Dual Heuristic, whose solutions are very close to the optimal one. Moreover, results are impressive even on an absolute scale, since the compressor exhibits equal-or-better compression ratios than gzip and decompression times comparable with Snappy, the state-of-the-art of fast LZSS compressors

    Bicriteria Data Compression

    No full text
    Since the seminal work by Shannon, theoreticians have focused on designing compressors targeted at minimizing the output size without sacrificing much of the compression/decompression efficiency. On the other hand, software engineers have deployed several heuristics to implement compressors aimed at trading compressed space versus compression/decompression efficiency in order to match their application needs. In this paper we fill this gap by introducing the bicriteria data-compression problem that seeks to determine the shortest compressed file that can be decompressed in a given time bound. Then, inspired by modern data-storage applications, we instantiate the problem onto the family of Lempel--Ziv-based compressors (such as Snappy and LZ4) and solve it by combining in a novel and efficient way optimization techniques, string-matching data structures, and shortest path algorithms over properly (bi-)weighted graphs derived from the data-compression problem at hand. An extensive set of experiments complements our theoretical achievements by showing that the proposed algorithmic solution is very competitive with respect to state-of-the-art highly engineered compressors. Read More: https://epubs.siam.org/doi/abs/10.1137/17M112145
    corecore