2,176 research outputs found
Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform
Motivation
The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for
compression and indexing of text data, but the cost of computing the BWT of
very large string collections has prevented these techniques from being widely
applied to the large sets of sequences often encountered as the outcome of DNA
sequencing experiments. In previous work, we presented a novel algorithm that
allows the BWT of human genome scale data to be computed on very moderate
hardware, thus enabling us to investigate the BWT as a tool for the compression
of such datasets.
Results
We first used simulated reads to explore the relationship between the level
of compression and the error rate, the length of the reads and the level of
sampling of the underlying genome and compare choices of second-stage
compression algorithm.
We demonstrate that compression may be greatly improved by a particular
reordering of the sequences in the collection and give a novel `implicit
sorting' strategy that enables these benefits to be realised without the
overhead of sorting the reads. With these techniques, a 45x coverage of real
human genome sequence data compresses losslessly to under 0.5 bits per base,
allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming
a small proportion of low-quality bases from the reads improves the compression
still further).
This is more than 4 times smaller than the size achieved by a standard
BWT-based compressor (bzip2) on the untrimmed reads, but an important further
advantage of our approach is that it facilitates the building of compressed
full text indexes such as the FM-index on large-scale DNA sequence collections.Comment: Version here is as submitted to Bioinformatics and is same as the
previously archived version. This submission registers the fact that the
advanced access version is now available at
http://bioinformatics.oxfordjournals.org/content/early/2012/05/02/bioinformatics.bts173.abstract
. Bioinformatics should be considered as the original place of publication of
this article, please cite accordingl
GPU-Accelerated BWT Construction for Large Collection of Short Reads
Advances in DNA sequencing technology have stimulated the development of
algorithms and tools for processing very large collections of short strings
(reads). Short-read alignment and assembly are among the most well-studied
problems. Many state-of-the-art aligners, at their core, have used the
Burrows-Wheeler transform (BWT) as a main-memory index of a reference genome
(typical example, NCBI human genome). Recently, BWT has also found its use in
string-graph assembly, for indexing the reads (i.e., raw data from DNA
sequencers). In a typical data set, the volume of reads is tens of times of the
sequenced genome and can be up to 100 Gigabases. Note that a reference genome
is relatively stable and computing the index is not a frequent task. For reads,
the index has to computed from scratch for each given input. The ability of
efficient BWT construction becomes a much bigger concern than before. In this
paper, we present a practical method called CX1 for constructing the BWT of
very large string collections. CX1 is the first tool that can take advantage of
the parallelism given by a graphics processing unit (GPU, a relative cheap
device providing a thousand or more primitive cores), as well as simultaneously
the parallelism from a multi-core CPU and more interestingly, from a cluster of
GPU-enabled nodes. Using CX1, the BWT of a short-read collection of up to 100
Gigabases can be constructed in less than 2 hours using a machine equipped with
a quad-core CPU and a GPU, or in about 43 minutes using a cluster with 4 such
machines (the speedup is almost linear after excluding the first 16 minutes for
loading the reads from the hard disk). The previously fastest tool BRC is
measured to take 12 hours to process 100 Gigabases on one machine; it is
non-trivial how BRC can be parallelized to take advantage a cluster of
machines, let alone GPUs.Comment: 11 page
A representation of a compressed de Bruijn graph for pan-genome analysis that enables search
Recently, Marcus et al. (Bioinformatics 2014) proposed to use a compressed de
Bruijn graph to describe the relationship between the genomes of many
individuals/strains of the same or closely related species. They devised an
time algorithm called splitMEM that constructs this graph
directly (i.e., without using the uncompressed de Bruijn graph) based on a
suffix tree, where is the total length of the genomes and is the length
of the longest genome. In this paper, we present a construction algorithm that
outperforms their algorithm in theory and in practice. Moreover, we propose a
new space-efficient representation of the compressed de Bruijn graph that adds
the possibility to search for a pattern (e.g. an allele - a variant form of a
gene) within the pan-genome.Comment: Submitted to Algorithmica special issue of CPM201
- …