30 research outputs found

    Efficient compression of large repetitive strings

    Get PDF
    When is comes to managing large volumes of data, general-purpose compressors such as gzip are ubiquitous. They are fast, practical and available on every modern platform from standard desktops to mobile devices. These tools exploit local redundancy in a text using a fixed-size sliding window. This window is usually very small relative to the text, however, in principle it can be as large as available memory. The window acts as a dictionary. Compression is achieved by replacing substrings with pointers to previous occurrences found in the dictionary. This type of algorithm becomes problematic when dealing with collections that are larger than physical memory, as it fails to capture any non-local redundancy, that is, repetition that occurs outside of its search window. With rapid growth in the already enormous amount of data we store and process there is a pressing need for improving compression effectiveness, reducing both storage requirements and decompression costs. However, many systems still use general-purpose compression tools on large highly repetitive data collections. In this thesis we focus on addressing this issue. We explore compression in a variety of domains where large volumes of data need to be stored and accessed, and general-purpose compression tools are cannon. First we discuss our work on web corpus compression, then we discuss the implementation of a practical index for repetitive texts that gives strong theoretical bounds in terms of size and access, and finally, we discuss our work on compression of high-throughput sequencing reads. We show that in all cases, our new methods improve on current techniques in both run-time and compression effectiveness, and provide important functionality such as fast decoding and random access

    RLZAP: Relative Lempel-Ziv with Adaptive Pointers

    Full text link
    Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of genomes from individuals of the same species when fast random access is desired. With Kuruppu et al.'s (SPIRE 2010) original implementation, a reference genome is selected and then the other genomes are greedily parsed into phrases exactly matching substrings of the reference. Deorowicz and Grabowski (Bioinformatics, 2011) pointed out that letting each phrase end with a mismatch character usually gives better compression because many of the differences between individuals' genomes are single-nucleotide substitutions. Ferrada et al. (SPIRE 2014) then pointed out that also using relative pointers and run-length compressing them usually gives even better compression. In this paper we generalize Ferrada et al.'s idea to handle well also short insertions, deletions and multi-character substitutions. We show experimentally that our generalization achieves better compression than Ferrada et al.'s implementation with comparable random-access times

    Block Graphs in Practice

    Get PDF
    Motivated by the rapidly increasing size of genomic databases, code repositories and versioned texts, several compression schemes have been proposed that work well on highly-repetitive strings and also support fast random access: e.g., LZ-End, RLZ, GDC, augmented SLPs, and block graphs. Block graphs have good worst-case bounds but it has been an open question whether they are practical. We describe an implementation of block graphs that, for several standard datasets, provides better compression and faster random access than competing schemes.Peer reviewe

    Relative Lempel-Ziv Compression of Suffix Arrays

    Get PDF
    We show that a combination of differential encoding, random sampling, and relative Lempel-Ziv (RLZ) parsing is effective for compressing suffix arrays, while simultaneously allowing very fast decompression of arbitrary suffix array intervals, facilitating pattern matching. The resulting text index, while somewhat larger (5-10x) than the recent r-index of Gagie, Navarro, and Prezza (Proc. SODA ’18)—still provides significant compression, and allows pattern location queries to be answered more than two orders of magnitude faster in practice.Peer reviewe

    Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections

    No full text
    Compression techniques that support fast random access are a core component of any information system. Current stateof-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as gzip. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression effectiveness and document retrieval times. In this paper we present a scalable compression method for large document collections that allows fast random access. We build a representative sample of the collection and use it as a dictionary in a LZ77-like encoding of the rest of the collection, relative to the dictionary. We demonstrate on large collections, that using a dictionary as small as 0.1 % of the collection size, our algorithm is dramatically faster than previous methods, and in general gives much better compression. 1

    Prioritising catchment management projects to improve marine water quality

    No full text
    Runoff from human land-uses is one of the most significant threats to some coastal marine environments. Initiatives to reduce that runoff usually set runoff reduction targets but do not give guidance on how to prioritize the different options that exist to achieve them. This paper demonstrates an easy to interpret economic framework to prioritise investment for conservation projects that aim to reduce pollution of marine ecosystems caused by runoff from agricultural land-uses. We demonstrate how to apply this framework using data on project cost, benefit and feasibility with a subset of projects that have been funded to reduce runoff from subcatchments adjacent to the Great Barrier Reef. Our analysis provides a graphical overview of the cost-effectiveness of the investment options, enables transparent planning for different budgets, assesses the existence of trends in the cost-effectiveness of different categories, and can test if the results are robust under uncertainty in one or more of the parameters. The framework provided solutions that were up to 4 times more efficient than when omitting information on cost or benefit. The presented framework can be used as a benchmark for evaluating results from a range of prioritisation processes against the best possible conservation outcomes

    Comportement tribologique en glissement sec de fontes a chemises de moteurs thermiques

    Get PDF
    SIGLECNRS TD 15304 / INIST-CNRS - Institut de l'Information Scientifique et TechniqueFRFranc

    Physical characterisation of high amylose maize starch and acylated high amylose maize starches

    Get PDF
    The particle size, water sorption properties and molecular mobility of high amylose maize starch (HAMS) and high amylose maize starch acylated with acetate (HAMSA), propionate (HAMSP) and butyrate (HAMSB) were investigated. Acylation increased the mean particle size (D(4,3)) and lowered the specific gravity (G) of the starch granules with an inverse relationship between the length of the fatty acid chain and particle size. Acylation of HAMS with fatty acids lowered the monolayer moisture content with the trend being HAMSB<HAMSA<HAMSP<HAMS, showing that the decrease is affected by factors other than the length of the fatty acid chain. Measurement of molecular mobility of the starch granules by NMR spectroscopy with Carr-Purcell-Meiboom-Gill (CMPG) experiments showed that T2 long was reduced in acylated starches and that drying and storage of the starch granules further reduced T2 long. Analysis of the Free Induction Decay (FID) focussing on the short components of T2 (correlated to the solid matrix), indicated that drying and subsequent storage resulted in alterations of starch at 0.33a(w) and that these changes were reduced with acylation. In vitro enzymatic digestibility of heated starch dispersions by bacterial α-amylase was increased by acylation (HAMS<HAMSB<HAMSP≤HAMSA) showing that the trend was not related to the length of the fatty acid chain. Digestibility was enhanced with an increase in particle size, or decrease in G, and inversely proportional to the total T2 signal. It is suggested that both external surface area and an internal network of pores and channels collectively influence the digestibility of starch
    corecore