2,243 research outputs found

    Huffman-based Code Compression Techniques for Embedded Systems

    Get PDF

    Implementation of a dna compression algorithm using dataflow computing

    Get PDF
    The amount of DNA sequences databases has increased a lot in the last years, the amount of space required to store the sequences is increasing more than the space available to store them, that means a higher cost to store DNA sequences and also the read sequences which are fragments of the whole sequence. This situation has led to the use of compression algorithms for storing DNA files. The main objective of the project is to increase the efficiency of the compression of DNA sequences because the process requires a lot of compute. An FPGA with dataflow architecture has been used to develop the project with the aim of exploiting the available parallelism in the algorithm chosen. The compression method has been developed to process sequence reads with a fixed amount of mutations per read and the test has been developed for 4, 8, 12 and 16 mutations per reads using an architecture that allows up to 160 reads to be processed in only one thick. Experimental results showed that even with a low amount of processing units, the performance increases a lot using the DFE architecture, the only disadvantage is the store/reading time. Palabras claves : Compression, Dataflow Engine (DFE), FPGA, CPU, DNA, Maxele

    Efficient Storage of Genomic Sequences in High Performance Computing Systems

    Get PDF
    ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction

    Handling Massive N-Gram Datasets Efficiently

    Get PDF
    This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2

    Compiler optimization and ordering effects on VLIW code compression

    Get PDF
    Code size has always been an important issue for all embedded applications as well as larger systems. Code compression techniques have been devised as a way of battling bloated code; however, the impact of VLIW compiler methods and outputs on these compression schemes has not been thoroughly investigated. This paper describes the application of single- and multipleinstruction dictionary methods for code compression to decrease overall code size for the TI TMS320C6xxx DSP family. The compression scheme is applied to benchmarks taken from the Mediabench benchmark suite built with differing compiler optimization parameters. In the single instruction encoding scheme, it was found that compression ratios were not a useful indicator of the best overall code size – the best results (smallest overall code size) were obtained when the compression scheme was applied to sizeoptimized code. In the multiple instruction encoding scheme, changing parallel instruction order was found to only slightly improve compression in unoptimized code and does not affect the code compression when it is applied to builds already optimized for size

    Compiler optimization and ordering effects on VLIW code compression

    Get PDF

    CoMET: Compressing Microcontroller Execution Traces to Assist System Understanding

    Get PDF
    Recent technology advances have made possible the retrieval of execution traces on microcontrollers. However, even after a short execution time of the embedded program, the collected execution trace contains a huge amount of data. This is due to the cyclic nature of embedded programs. The huge amount of data makes extremely difficult and time-consuming the understanding of the program behavior. Software engineers need a way to get a quick understanding of execution traces. In this paper, we present an approach based on an improvement of the Sequitur algorithm to compress large execution traces of microcontrollers. By leveraging both cycles and repetitions present in such execution traces, our approach offers a compact and accurate compression of execution traces. This compression may be used by software engineers to understand the behavior of the system, for instance, identifying cycles that appears most often in the trace or comparing different cycles. Our evaluations give two major results. On one hand our approach gives high compression rate on microcontroller execution traces. On the other hand software engineers mostly agree that generated outputs (compressions) may help reviewing and understanding execution traces
    • …
    corecore