Search CORE

844 research outputs found

CIGARCoil: A New Algorithm for the Compression of DNA Sequencing Data

Author: Womack Addison
Publication venue
Publication date: 05/05/2019
Field of study

DNA sequencing machines produce tens of thousands to hundreds of millions of reads. Each read consists of letters from the alphabet X= {A, T, C, G, N} and varies in length between 30 to 120 characters and beyond. The DNA reads are stored in a standard FASTQ file format that contains not only the reads but also a quality score for each character in each read that corresponds to the probability that the identified character is correct. The FASTQ files vary in size between 100s of megabytes to 10s of gigabytes. The reads in the FASTQ files are processed as part of many DNA algorithms for various sequence analyses. Given the fact that the size of each file is considerable, keeping and handling multiple of these files in main memory for faster processing is not possible on commodity hardware. In this thesis, we propose a lossless compression mechanism named CIGARCoil that operates on the FASTQ files and other files that contain the DNA reads. The other salient features of CIGARCoil are: • It is a not a reference-based algorithm in the sense that one does not need to create a reference string before the compression can begin. Reference strings are undesirable due to them not only being hard to determine, but also due to them being required for both the compression and decompression of the file. • In this thesis, for the first time, we show that each of the reads can be accessed directly on the compressed structure created by CIGARCoil. That is, we provide access to each read without having to uncompress the file. • Since we can provide direct access to a read on the CIGARCoil compressed structure, we have implemented a [] (square-bracket) array indexing operator. Through this implementation, we can implement a predictive caching mechanism that will make the reads available for the enduser based on their access pattern. We have analyzed our compressed mechanism on various well-known FASTQ data sets along with synthetic data sets. In all cases, our compression method produces a compressed file that is smaller or approximately the same size as ones created by the existing DNA compression mechanisms, including BZIP, DSRC2, and LFQC

SHAREOK repository

Gerbil: A Fast and Memory-Efficient $k$ -mer Counter with GPU-Support

Author: Erbert Marius
Müller-Hannemann Matthias
Rechner Steffen
Publication venue
Publication date: 22/07/2016
Field of study

A basic task in bioinformatics is the counting of

k

-mers in genome strings. The

k

-mer counting problem is to build a histogram of all substrings of length

k

in a given genome sequence. We present the open source

k

-mer counting software Gerbil that has been designed for the efficient counting of

k

-mers for

k\geq32

. Given the technology trend towards long reads of next-generation sequencers, support for large

k

becomes increasingly important. While existing

k

-mer counting tools suffer from excessive memory resource consumption or degrading performance for large

k

, Gerbil is able to efficiently support large

k

without much loss of performance. Our software implements a two-disk approach. In the first step, DNA reads are loaded from disk and distributed to temporary files that are stored at a working disk. In a second step, the temporary files are read again, split into

k

-mers and counted via a hash table approach. In addition, Gerbil can optionally use GPUs to accelerate the counting step. For large

k

, we outperform state-of-the-art open source

k

-mer counting tools for large genome data sets.Comment: A short version of this paper will appear in the proceedings of WABI 201

arXiv.org e-Print Archive

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

biobambam: tools for read pair collation based algorithms on BAM files

Author: Leonard Steven
Tischler German
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/06/2013
Field of study

Sequence alignment data is often ordered by coordinate (id of the reference sequence plus position on the sequence where the fragment was mapped) when stored in BAM files, as this simplifies the extraction of variants between the mapped data and the reference or of variants within the mapped data. In this order paired reads are usually separated in the file, which complicates some other applications like duplicate marking or conversion to the FastQ format which require to access the full information of the pairs. In this paper we introduce biobambam, an API for efficient BAM file reading supporting the efficient collation of alignments by read name without performing a complete resorting of the input file and some tools based on this API performing tasks like marking duplicate reads and conversion to the FastQ format. In comparison with previous approaches to problems involving the collation of alignments by read name like the BAM to FastQ or duplication marking utilities in the Picard suite the approach of biobambam can often perform an equivalent task more efficiently in terms of the required main memory and run-time.Comment: 17 pages, 3 figures, 2 table

arXiv.org e-Print Archive

Springer - Publisher Connector