719 research outputs found
Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform
Motivation
The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for
compression and indexing of text data, but the cost of computing the BWT of
very large string collections has prevented these techniques from being widely
applied to the large sets of sequences often encountered as the outcome of DNA
sequencing experiments. In previous work, we presented a novel algorithm that
allows the BWT of human genome scale data to be computed on very moderate
hardware, thus enabling us to investigate the BWT as a tool for the compression
of such datasets.
Results
We first used simulated reads to explore the relationship between the level
of compression and the error rate, the length of the reads and the level of
sampling of the underlying genome and compare choices of second-stage
compression algorithm.
We demonstrate that compression may be greatly improved by a particular
reordering of the sequences in the collection and give a novel `implicit
sorting' strategy that enables these benefits to be realised without the
overhead of sorting the reads. With these techniques, a 45x coverage of real
human genome sequence data compresses losslessly to under 0.5 bits per base,
allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming
a small proportion of low-quality bases from the reads improves the compression
still further).
This is more than 4 times smaller than the size achieved by a standard
BWT-based compressor (bzip2) on the untrimmed reads, but an important further
advantage of our approach is that it facilitates the building of compressed
full text indexes such as the FM-index on large-scale DNA sequence collections.Comment: Version here is as submitted to Bioinformatics and is same as the
previously archived version. This submission registers the fact that the
advanced access version is now available at
http://bioinformatics.oxfordjournals.org/content/early/2012/05/02/bioinformatics.bts173.abstract
. Bioinformatics should be considered as the original place of publication of
this article, please cite accordingl
Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly
Motivation: Eugene Myers in his string graph paper (Myers, 2005) suggested
that in a string graph or equivalently a unitig graph, any path spells a valid
assembly. As a string/unitig graph also encodes every valid assembly of reads,
such a graph, provided that it can be constructed correctly, is in fact a
lossless representation of reads. In principle, every analysis based on
whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion
(INDEL) calling, can also be achieved with unitigs.
Results: To explore the feasibility of using de novo assembly in the context
of resequencing, we developed a de novo assembler, fermi, that assembles
Illumina short reads into unitigs while preserving most of information of the
input reads. SNPs and INDELs can be called by mapping the unitigs against a
reference genome. By applying the method on 35-fold human resequencing data, we
showed that in comparison to the standard pipeline, our approach yields similar
accuracy for SNP calling and better results for INDEL calling. It has higher
sensitivity than other de novo assembly based methods for variant calling. Our
work suggests that variant calling with de novo assembly be a beneficial
complement to the standard variant calling pipeline for whole-genome
resequencing. In the methodological aspects, we proposed FMD-index for
forward-backward extension of DNA sequences, a fast algorithm for finding all
super-maximal exact matches and one-pass construction of unitigs from an
FMD-index.
Availability: http://github.com/lh3/fermi
Contact: [email protected]: Rev2: submitted version with minor improvements; 7 page
- …