717 research outputs found

    Reference Based Genome Compression

    Full text link
    DNA sequencing technology has advanced to a point where storage is becoming the central bottleneck in the acquisition and mining of more data. Large amounts of data are vital for genomics research, and generic compression tools, while viable, cannot offer the same savings as approaches tuned to inherent biological properties. We propose an algorithm to compress a target genome given a known reference genome. The proposed algorithm first generates a mapping from the reference to the target genome, and then compresses this mapping with an entropy coder. As an illustration of the performance: applying our algorithm to James Watson's genome with hg18 as a reference, we are able to reduce the 2991 megabyte (MB) genome down to 6.99 MB, while Gzip compresses it to 834.8 MB.Comment: 5 pages; Submitted to the IEEE Information Theory Workshop (ITW) 201

    Enhancing Text Compression Method Using Information Source Indexing

    Get PDF
    Text compression methods where the original texts are directly mapped into binary domain are attractive to compress English text files. This paper proposes an intermediate mapping scheme in which the original English text is transformed firstly to decimal domain and then to binary domain. Each two-decimal-digit value in the resulting intermediate decimal file represents the index to the location of each alphabet found in the original text. If the already indexed alphabet is seen again, it will be replaced by the previously given decimal-index number. The decimal file is converted into binary domain by assigning each decimal digit a 4-bit weighted code in according to its frequency of occurrence that is akin to BCD code. The assigned codes aim at generating an equivalent binary file with entropy as close as much to that of the original one. Thereafter, any conventional compression algorithm such as Lempel-Ziv algorithms can be applied to the generated binary file. The obtained compression ratios outperform those ones obtained when applying the same compression algorithm to the binary files generated either via direct mapping of the original text or via mapping the decimal file using Binary Coded Decimal (BCD) codes. Keywords: Lossless data compression; Source encoding, LZW coding, Hamming weights, Compression ratio

    Less redundant codes for variable size dictionaries

    Get PDF
    We report on a family of variable-length codes with less redundancy than the flat code used in most of the variable size dictionary-based compression methods. The length of codes belonging to this family is still bounded above by [log_2/ |D|] where |D| denotes the dictionary size. We describe three of these codes, namely, the balanced code, the phase-in-binary code (PB), and the depth-span code (DS). As the name implies, the balanced code is constructed by a height balanced tree, so it has the shortest average codeword length. The corresponding coding tree for the PB code has an interesting property that it is made of full binary phases, and thus the code can be computed efficiently using simple binary shifting operations. The DS coding tree is maintained in such a way that the coder always finds the longest extendable codeword and extends it until it reaches the maximum length. It is optimal with respect to the code-length contrast. The PB and balanced codes have almost similar improvements, around 3% to 7% which is very close to the relative redundancy in flat code. The DS code is particularly good in dealing with files with a large amount of redundancy, such as a running sequence of one symbol. We also did some empirical study on the codeword distribution in the LZW dictionary and proposed a scheme called dynamic block shifting (DBS) to further improve the codes' performance. Experiments suggest that the DBS is helpful in compressing random sequences. From an application point of view, PB code with DBS is recommended for general practical usage

    Indexing Highly Repetitive String Collections

    Full text link
    Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed in order to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey we cover the algorithmic developments that have led to these data structures. We describe the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that have been proposed, comparing them both in theoretical and practical aspects. We conclude with the current challenges in this fascinating field

    Re-Use Dynamic Programming for Sequence Alignment: An Algorithmic Toolkit

    Get PDF
    International audienceThe problem of comparing two sequences S and T to determine their similarity is one of the fundamental problems in pattern matching. In this manuscript we will be primarily concerned with sequences as our objects and with various string comparison metrics. Our goal is to survey a methodology for utilizing repetitions in sequences in order to speed up the comparison process. Within this framework we consider various methods of parsing the sequences in order to frame their repetitions, and present a toolkit of various solutions whose time complexity depends both on the chosen parsing method as well as on the string-comparison metric used for the alignment

    Practical Aspects of Implementing a Suffix Array-based Lempel-Ziv Data Compressor

    Get PDF
    Lempel-Ziv factorization of a string is a fundamental tool that is used by myriad data compressors. Despite its optimality regarding the number of produced factors, it is rarely used without modification, for reasons of its computational cost. In recent years, Lempel-Ziv factorization has been a busy research subject, and we have witnessed the state-of-the-art being completely changed. In this thesis, I explore the properties of the latest suffix array-based Lempel-Ziv factorization algorithms, while I experiment with turning them into an efficient general-purpose data compressor. The setting of this thesis is purely exploratory, guided by reliable and repeatable benchmarking. I explore all aspects of the suffix array-based Lempel-Ziv data compressor. I describe how the chosen factorization method affects the development of encoding and other components of a functional data compressor. I show how the chosen factorization technique, together with capabilities of modern hardware, allows determining the length of the longest common prefix of two strings over 80% faster compared to the baseline approach. I also present a novel approach to optimizing the encoding cost of the Lempel-Ziv factorization of a string, i.e., bit-optimality, using a dynamic programming approach to the Single-Source Shortest Path problem. I observed that, in its current state, the process of suffix array construction is a major computational bottleneck in suffix array-based Lempel-Ziv factorization. Additionally, using a suffix array to produce a Lempel-Ziv factorization leads to optimality regarding the number of factors, which does not necessarily correspond to bit-optimality. Finally, a comparison with common third-party data compressors revealed that relying exclusively on Lempel-Ziv factorization prevents reaching the highest compression efficiency. For these reasons, I conclude that current suffix array-based Lempel-Ziv factorization is unsuitable for general-purpose data compression
    • …
    corecore