A basic task in bioinformatics is the counting of $k$-mers in genome strings.
The $k$-mer counting problem is to build a histogram of all substrings of
length $k$ in a given genome sequence. We present the open source $k$-mer
counting software Gerbil that has been designed for the efficient counting of
$k$-mers for $k\geq32$. Given the technology trend towards long reads of
next-generation sequencers, support for large $k$ becomes increasingly
important. While existing $k$-mer counting tools suffer from excessive memory
resource consumption or degrading performance for large $k$, Gerbil is able to
efficiently support large $k$ without much loss of performance. Our software
implements a two-disk approach. In the first step, DNA reads are loaded from
disk and distributed to temporary files that are stored at a working disk. In a
second step, the temporary files are read again, split into $k$-mers and
counted via a hash table approach. In addition, Gerbil can optionally use GPUs
to accelerate the counting step. For large $k$, we outperform state-of-the-art
open source $k$-mer counting tools for large genome data sets.Comment: A short version of this paper will appear in the proceedings of WABI
  201

Erbert, Marius

Müller-Hannemann, Matthias

Rechner, Steffen

English

arXiv

Abstract Background A basic task in bioinformatics is the counting of k-mers in genome sequences. Existing k-mer counting tools are most often optimized for small k < 32 and suffer from excessive memory resource consumption or degrading performance for large k. However, given the technology trend towards long reads of next-generation sequencers, support for large k becomes increasingly important. Results We present the open source k-mer counting software Gerbil that has been designed for the efficient counting of k-mers for k ≥ 32. Our software is the result of an intensive process of algorithm engineering. It implements a two-step approach. In the first step, genome reads are loaded from disk and redistributed to temporary files. In a second step, the k-mers of each temporary file are counted via a hash table approach. In addition to its basic functionality, Gerbil can optionally use GPUs to accelerate the counting step. In a set of experiments with real-world genome data sets, we show that Gerbil is able to efficiently support both small and large k. Conclusions While Gerbil’s performance is comparable to existing state-of-the-art open source k-mer counting tools for small k < 32, it vastly outperforms its competitors for large k, thereby enabling new applications which require large values of k

Marius Erbert

Steffen Rechner

Matthias Müller-Hannemann

Directory of Open Access Journals

Algorithms for Molecular Biology

Gerbil: a fast and memory-efficient k-mer counter with GPU-support

Springer - Publisher Connector

file:///data/core-remote/dit/data/Springer-OA/pdf/5f5/aHR0cDovL2xpbmsuc3ByaW5nZXIuY29tLzEwLjExODYvczEzMDE1LTAxNy0wMDk3LTkucGRm.pdf

Gerbil: A Fast and Memory-Efficient $k$-mer Counter with GPU-Support

Gerbil: A Fast and Memory-Efficient $k$ -mer Counter with GPU-Support

Abstract

Similar works

Full text

Available Versions

Directory of Open Access Journals

Springer - Publisher Connector

Gerbil: A Fast and Memory-Efficient kkk-mer Counter with GPU-Support

Abstract

Similar works

Full text

Available Versions

Directory of Open Access Journals

Springer - Publisher Connector

Gerbil: A Fast and Memory-Efficient $k$ -mer Counter with GPU-Support