research

Gerbil: A Fast and Memory-Efficient kk-mer Counter with GPU-Support

Abstract

A basic task in bioinformatics is the counting of kk-mers in genome strings. The kk-mer counting problem is to build a histogram of all substrings of length kk in a given genome sequence. We present the open source kk-mer counting software Gerbil that has been designed for the efficient counting of kk-mers for k32k\geq32. Given the technology trend towards long reads of next-generation sequencers, support for large kk becomes increasingly important. While existing kk-mer counting tools suffer from excessive memory resource consumption or degrading performance for large kk, Gerbil is able to efficiently support large kk without much loss of performance. Our software implements a two-disk approach. In the first step, DNA reads are loaded from disk and distributed to temporary files that are stored at a working disk. In a second step, the temporary files are read again, split into kk-mers and counted via a hash table approach. In addition, Gerbil can optionally use GPUs to accelerate the counting step. For large kk, we outperform state-of-the-art open source kk-mer counting tools for large genome data sets.Comment: A short version of this paper will appear in the proceedings of WABI 201

    Similar works