10,134 research outputs found
MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting
A major challenge in next-generation genome sequencing (NGS) is to assemble
massive overlapping short reads that are randomly sampled from DNA fragments.
To complete assembling, one needs to finish a fundamental task in many leading
assembly algorithms: counting the number of occurrences of k-mers (length-k
substrings in sequences). The counting results are critical for many components
in assembly (e.g. variants detection and read error correction). For large
genomes, the k-mer counting task can easily consume a huge amount of memory,
making it impossible for large-scale parallel assembly on commodity servers.
In this paper, we develop MSPKmerCounter, a disk-based approach, to
efficiently perform k-mer counting for large genomes using a small amount of
memory. Our approach is based on a novel technique called Minimum Substring
Partitioning (MSP). MSP breaks short reads into multiple disjoint partitions
such that each partition can be loaded into memory and processed individually.
By leveraging the overlaps among the k-mers derived from the same short read,
MSP can achieve astonishing compression ratio so that the I/O cost can be
significantly reduced. For the task of k-mer counting, MSPKmerCounter offers a
very fast and memory-efficient solution. Experiment results on large real-life
short reads data sets demonstrate that MSPKmerCounter can achieve better
overall performance than state-of-the-art k-mer counting approaches.
MSPKmerCounter is available at http://www.cs.ucsb.edu/~yangli/MSPKmerCounte
Multiple Comparative Metagenomics using Multiset k-mer Counting
Background. Large scale metagenomic projects aim to extract biodiversity
knowledge between different environmental conditions. Current methods for
comparing microbial communities face important limitations. Those based on
taxonomical or functional assignation rely on a small subset of the sequences
that can be associated to known organisms. On the other hand, de novo methods,
that compare the whole sets of sequences, either do not scale up on ambitious
metagenomic projects or do not provide precise and exhaustive results.
Methods. These limitations motivated the development of a new de novo
metagenomic comparative method, called Simka. This method computes a large
collection of standard ecological distances by replacing species counts by
k-mer counts. Simka scales-up today's metagenomic projects thanks to a new
parallel k-mer counting strategy on multiple datasets.
Results. Experiments on public Human Microbiome Project datasets demonstrate
that Simka captures the essential underlying biological structure. Simka was
able to compute in a few hours both qualitative and quantitative ecological
distances on hundreds of metagenomic samples (690 samples, 32 billions of
reads). We also demonstrate that analyzing metagenomes at the k-mer level is
highly correlated with extremely precise de novo comparison techniques which
rely on all-versus-all sequences alignment strategy or which are based on
taxonomic profiling
- …