742 research outputs found
Efficient counting of k-mers in DNA sequences using a bloom filter
<p>Abstract</p> <p>Background</p> <p>Counting <it>k</it>-mers (substrings of length <it>k </it>in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting <it>k</it>-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing <it>k</it>-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton <it>k</it>-mers are uninformative for many algorithms without some kind of error correction.</p> <p>Results</p> <p>We present a new method that identifies all the <it>k</it>-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed <it>k</it>-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique <it>k</it>-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting <it>k</it>-mers in sequence data with errors.</p> <p>Conclusions</p> <p>A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at <url>http://pritch.bsd.uchicago.edu/bfcounter.html</url></p
khmer: Working with Big Data in Bioinformatics
We introduce design and optimization considerations for the 'khmer' package.Comment: Invited chapter for forthcoming book on Performance of Open Source
Application
These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure
K-mer abundance analysis is widely used for many purposes in nucleotide
sequence analysis, including data preprocessing for de novo assembly, repeat
detection, and sequencing coverage estimation. We present the khmer software
package for fast and memory efficient online counting of k-mers in sequencing
data sets. Unlike previous methods based on data structures such as hash
tables, suffix arrays, and trie structures, khmer relies entirely on a simple
probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits
online updating and retrieval of k-mer counts in memory which is necessary to
support online k-mer analysis algorithms. On sparse data sets this data
structure is considerably more memory efficient than any exact data structure.
In exchange, the use of a Count-Min Sketch introduces a systematic overcount
for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we
analyze the speed, the memory usage, and the miscount rate of khmer for
generating k-mer frequency distributions and retrieving k-mer counts for
individual k-mers. We also compare the performance of khmer to several other
k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC,
Turtle and KAnalyze. Finally, we examine the effectiveness of profiling
sequencing error, k-mer abundance trimming, and digital normalization of reads
in the context of high khmer false positive rates. khmer is implemented in C++
wrapped in a Python interface, offers a tested and robust API, and is freely
available under the BSD license at github.com/ged-lab/khmer
MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting
A major challenge in next-generation genome sequencing (NGS) is to assemble
massive overlapping short reads that are randomly sampled from DNA fragments.
To complete assembling, one needs to finish a fundamental task in many leading
assembly algorithms: counting the number of occurrences of k-mers (length-k
substrings in sequences). The counting results are critical for many components
in assembly (e.g. variants detection and read error correction). For large
genomes, the k-mer counting task can easily consume a huge amount of memory,
making it impossible for large-scale parallel assembly on commodity servers.
In this paper, we develop MSPKmerCounter, a disk-based approach, to
efficiently perform k-mer counting for large genomes using a small amount of
memory. Our approach is based on a novel technique called Minimum Substring
Partitioning (MSP). MSP breaks short reads into multiple disjoint partitions
such that each partition can be loaded into memory and processed individually.
By leveraging the overlaps among the k-mers derived from the same short read,
MSP can achieve astonishing compression ratio so that the I/O cost can be
significantly reduced. For the task of k-mer counting, MSPKmerCounter offers a
very fast and memory-efficient solution. Experiment results on large real-life
short reads data sets demonstrate that MSPKmerCounter can achieve better
overall performance than state-of-the-art k-mer counting approaches.
MSPKmerCounter is available at http://www.cs.ucsb.edu/~yangli/MSPKmerCounte
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
- …