158,822 research outputs found

    Compressing DNA sequence databases with coil

    Get PDF
    Background: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. Results: We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression – the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. Conclusion: coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work

    File Updates Under Random/Arbitrary Insertions And Deletions

    Full text link
    A client/encoder edits a file, as modeled by an insertion-deletion (InDel) process. An old copy of the file is stored remotely at a data-centre/decoder, and is also available to the client. We consider the problem of throughput- and computationally-efficient communication from the client to the data-centre, to enable the server to update its copy to the newly edited file. We study two models for the source files/edit patterns: the random pre-edit sequence left-to-right random InDel (RPES-LtRRID) process, and the arbitrary pre-edit sequence arbitrary InDel (APES-AID) process. In both models, we consider the regime in which the number of insertions/deletions is a small (but constant) fraction of the original file. For both models we prove information-theoretic lower bounds on the best possible compression rates that enable file updates. Conversely, our compression algorithms use dynamic programming (DP) and entropy coding, and achieve rates that are approximately optimal.Comment: The paper is an extended version of our paper to be appeared at ITW 201

    Digital forensics formats: seeking a digital preservation storage format for web archiving

    Get PDF
    In this paper we discuss archival storage formats from the point of view of digital curation and preservation. Considering established approaches to data management as our jumping off point, we selected seven format attributes which are core to the long term accessibility of digital materials. These we have labeled core preservation attributes. These attributes are then used as evaluation criteria to compare file formats belonging to five common categories: formats for archiving selected content (e.g. tar, WARC), disk image formats that capture data for recovery or installation (partimage, dd raw image), these two types combined with a selected compression algorithm (e.g. tar+gzip), formats that combine packing and compression (e.g. 7-zip), and forensic file formats for data analysis in criminal investigations (e.g. aff, Advanced Forensic File format). We present a general discussion of the file format landscape in terms of the attributes we discuss, and make a direct comparison between the three most promising archival formats: tar, WARC, and aff. We conclude by suggesting the next steps to take the research forward and to validate the observations we have made

    IMPLEMENTASI KOMPRESI DATA DENGAN MENGGUNAKAN METODE TRANSFORMASI BURROWS WHEELER

    Get PDF
    ABSTRAKSI: Pada saat ini, semakin banyak sumber-sumber yang menyediakan layanan data informasi online yang dapat diakses di mana saja. Data yang utuh berukuran kecil sangat diperlukan, sehingga data tidak membebani media jaringan saat transfer dilakukan, waktu transfer data yang dibutuhkan akan menurun, hal ini berarti penurunan biaya, dan peningkatan efisiensi penyimpanan data elektronik. Suatu metode kompresi untuk menyusutkan ukuran file data dengan tetap mempertahankan keutuhan informasi yang hendak disampaikan (lossless) sangat diperlukan sebagai salah satu solusi alternatif masalah ini.Dalam Tugas Akhir ini penulis akan membahas mengenai suatu metode transformasi yang dapat meningkatkan efektifitas teknik kompresi yang bernama Burrows-Wheeler Transformation. Metode tersebut akan dikombinasikan dengan algoritma kompresi Huffman, LZW dan RLE untuk mengetahui besar pengaruh transformasi ini dalam pengkompresian data dan kombinasi mana yang paling efektif dalam memampatkan berbagai jenis data.Hasil penelitian menujukkan bahwa setiap jenis algoritma mempunyai karakteristik yang berbeda-beda dan cocok untuk digunakan pada jenis file tertentu. Algoritma kompresi LZW memiliki hasil rasio kompresi tertinggi untuk melakukan kompresi terhadap file TXT (Rata-rata rasio = 58,74%), file BMP (Rata-rata rasio = 57,08%), file HTML (Rata-rata rasio = 43,80%). Algoritma kompresi Huffman memiliki hasil kompresi terbaik ketika digunakan untuk melakukan kompresi terhadap file DOC(Rata-rata rasio = 67,68%), file JPG (Rata-rata rasio = 97,93%), file WAV (Rata-rata rasio = 87,77%), file MP3 (Rata-rata rasio = 99,22%), file AVI (Rata-rata rasio = 82,31%), file MPG (Ratarata rasio = 98,15%), file PDF (Rata-rata rasio = 97,49%), dan file EXE (Rata-rata rasio = 89,41%). Algoritma RLE memiliki hasil kompresi yang rata-rata lebih buruk dibandingkan algoritma LZW dan Huffman. Sedangkan BWT dapat meningkatkan rasio kompresi LZW rata-rata 97,22%, dan rasio kompresi RLE rata-rata 73,45%. Tetapi BWT tidak berpengaruh terhadap rasio kompresi algoritma Huffman. Kombinasi BWT dengan gabungan algoritma yang memiliki rasio kompresi tertinggi adalah BWT + RLE + Huffman dengan rasio kompresi rata-rata 39,34%.Kata Kunci : Kata Kunci: Kompresi, Burrows-Wheeler, Algoritma Huffman, Algoritma LZW, Algoritma RLEABSTRACT: Nowadays, many sources provide online information data services that can be accessed everywhere. A whole small sized data is very needed,so the transferring process not load the media transfer, time reduction and improvement of electronic data storage efficiency. A compression method to reduce file size that constantly defend transferred whole information very needed as one of solution to solve this problem.In This final project, the writer will study about a transformation method which can raise compression technique effectively named Burrows-Wheeler Transformation (BWT). In this research, this method will be combined with Huffman Compression Algoritm, LZW Compression Algoritm and RLE Compression Algoritm. Combining BWT method with these algoritms can make us know how deep this method influence into data compression and which combination effectively compress the data.ceertain file. LZW compression algoritm effectively used for TXT file (Average Ratio=58,74%), BMP FIle (Average Ratio = 57,08%), HTML File (Average Ratio = 43,80%). Huffman compression algoritm effectively used for compression toward DOC File (Average Ratio =67,68%), JPG File (Average Ratio =97,93%), WAV File (Average Ratio =87,77%), MP3 file (Average Ratio =99,22%), MPG file (Average Ratio =98,15%), PDF file (Average Ratio =97,49%), and EXE file (Average Ratio =89,41%). RLE compression algorithm result the most bad compression. While BWT can increase LZW compression ratio average 97,22%, and RLE compression ratio kompresi average 73,45%. But, BWT does not has effect toward Huffman compression algoritm ratio.Keyword: Key Words: Compression, Burrows-Wheeler Transformation, Huffman Algorithm, LZW Algorithm
    • …
    corecore