68 research outputs found
Gerbil: A Fast and Memory-Efficient -mer Counter with GPU-Support
A basic task in bioinformatics is the counting of -mers in genome strings.
The -mer counting problem is to build a histogram of all substrings of
length in a given genome sequence. We present the open source -mer
counting software Gerbil that has been designed for the efficient counting of
-mers for . Given the technology trend towards long reads of
next-generation sequencers, support for large becomes increasingly
important. While existing -mer counting tools suffer from excessive memory
resource consumption or degrading performance for large , Gerbil is able to
efficiently support large without much loss of performance. Our software
implements a two-disk approach. In the first step, DNA reads are loaded from
disk and distributed to temporary files that are stored at a working disk. In a
second step, the temporary files are read again, split into -mers and
counted via a hash table approach. In addition, Gerbil can optionally use GPUs
to accelerate the counting step. For large , we outperform state-of-the-art
open source -mer counting tools for large genome data sets.Comment: A short version of this paper will appear in the proceedings of WABI
201
Parallel approach to sliding window sums
Sliding window sums are widely used in bioinformatics applications, including
sequence assembly, k-mer generation, hashing and compression. New vector
algorithms which utilize the advanced vector extension (AVX) instructions
available on modern processors, or the parallel compute units on GPUs and
FPGAs, would provide a significant performance boost for the bioinformatics
applications. We develop a generic vectorized sliding sum algorithm with
speedup for window size w and number of processors P is O(P/w) for a generic
sliding sum. For a sum with commutative operator the speedup is improved to
O(P/log(w)). When applied to the genomic application of minimizer based k-mer
table generation using AVX instructions, we obtain a speedup of over 5X.Comment: 10 pages, 5 figure
PlasmidTron: assembling the cause of phenotypes and genotypes from NGS data.
Increasingly rich metadata are now being linked to samples that have been whole-genome sequenced. However, much of this information is ignored. This is because linking this metadata to genes, or regions of the genome, usually relies on knowing the gene sequence(s) responsible for the particular trait being measured and looking for its presence or absence in that genome. Examples of this would be the spread of antimicrobial resistance genes carried on mobile genetic elements (MGEs). However, although it is possible to routinely identify the resistance gene, identifying the unknown MGE upon which it is carried can be much more difficult if the starting point is short-read whole-genome sequence data. The reason for this is that MGEs are often full of repeats and so assemble poorly, leading to fragmented consensus sequences. Since mobile DNA, which can carry many clinically and ecologically important genes, has a different evolutionary history from the host, its distribution across the host population will, by definition, be independent of the host phylogeny. It is possible to use this phenomenon in a genome-wide association study to identify both the genes associated with the specific trait and also the DNA linked to that gene, for example the flanking sequence of the plasmid vector on which it is encoded, which follows the same patterns of distribution as the marker gene/sequence itself. We present PlasmidTron, which utilizes the phenotypic data normally available in bacterial population studies, such as antibiograms, virulence factors, or geographical information, to identify traits that are likely to be present on DNA that can randomly reassort across defined bacterial populations. It is also possible to use this methodology to associate unknown genes/sequences (e.g. plasmid backbones) with a specific molecular signature or marker (e.g. resistance gene presence or absence) using PlasmidTron. PlasmidTron uses a k-mer-based approach to identify reads associated with a phylogenetically unlinked phenotype. These reads are then assembled de novo to produce contigs in a fast and scalable-to-large manner. PlasmidTron is written in Python 3 and is available under the open source licence GNU GPL3 from https://github.com/sanger-pathogens/plasmidtron
MetaTrinity: Enabling Fast Metagenomic Classification via Seed Counting and Edit Distance Approximation
Metagenomics, the study of genome sequences of diverse organisms cohabiting
in a shared environment, has experienced significant advancements across
various medical and biological fields. Metagenomic analysis is crucial, for
instance, in clinical applications such as infectious disease screening and the
diagnosis and early detection of diseases such as cancer. A key task in
metagenomics is to determine the species present in a sample and their relative
abundances. Currently, the field is dominated by either alignment-based tools,
which offer high accuracy but are computationally expensive, or alignment-free
tools, which are fast but lack the needed accuracy for many applications. In
response to this dichotomy, we introduce MetaTrinity, a tool based on
heuristics, to achieve a fundamental improvement in accuracy-runtime tradeoff
over existing methods. We benchmark MetaTrinity against two leading metagenomic
classifiers, each representing different ends of the performance-accuracy
spectrum. On one end, Kraken2, a tool optimized for performance, shows modest
accuracy yet a rapid runtime. The other end of the spectrum is governed by
Metalign, a tool optimized for accuracy. Our evaluations show that MetaTrinity
achieves an accuracy comparable to Metalign while gaining a 4x speedup without
any loss in accuracy. This directly equates to a fourfold improvement in
runtime-accuracy tradeoff. Compared to Kraken2, MetaTrinity requires a 5x
longer runtime yet delivers a 17x improvement in accuracy. This demonstrates a
3.4x enhancement in the accuracy-runtime tradeoff for MetaTrinity. This dual
comparison positions MetaTrinity as a broadly applicable solution for
metagenomic classification, combining advantages of both ends of the spectrum:
speed and accuracy. MetaTrinity is publicly available at
https://github.com/CMU-SAFARI/MetaTrinity
Fast Gapped k-mer Counting with Subdivided Multi-Way Bucketed Cuckoo Hash Tables
Motivation. In biological sequence analysis, alignment-free (also known as k-mer-based) methods are increasingly replacing mapping- and alignment-based methods for various applications. A basic step of such methods consists of building a table of all k-mers of a given set of sequences (a reference genome or a dataset of sequenced reads) and their counts. Over the past years, efficient methods and tools for k-mer counting have been developed. In a different line of work, the use of gapped k-mers has been shown to offer advantages over the use of the standard contiguous k-mers. However, no tool seems to be available that is able to count gapped k-mers with the same efficiency as contiguous k-mers. One reason is that the most efficient k-mer counters use minimizers (of a length m < k) to group k-mers into buckets, such that many consecutive k-mers are classified into the same bucket. This approach leads to cache-friendly (and hence extremely fast) algorithms, but the approach does not transfer easily to gapped k-mers. Consequently, the existing efficient k-mer counters cannot be trivially modified to count gapped k-mers with the same efficiency.
Results. We present a different approach that is equally applicable to contiguous k-mers and gapped k-mers. We use multi-way bucketed Cuckoo hash tables to efficiently store (gapped) k-mers and their counts. We also describe a method to parallelize counting over multiple threads without using locks: We subdivide the hash table into independent subtables, and use a producer-consumer model, such that each thread serves one subtable. This requires designing Cuckoo hash functions with the property that all alternative locations for each k-mer are located in the same subtable. Compared to some of the fastest contiguous k-mer counters, our approach is of comparable speed, or even faster, on large datasets, and it is the only one that supports gapped k-mers
GPU-accelerated k-mer counting
K-mer counting is the process of building a histogram of all substrings of length k for an input string S. The problem itself is quite simple, but counting k-mers efficiently for a very large input string is a difficult task that has been researched extensively. In recent years the performance of k-mer counting algorithms have improved significantly, and there have been efforts to use graphics processing units (GPUs) in k-mer counting. The goal for this thesis was to design, implement and benchmark a GPU accelerated k-mer counting algorithm SNCGPU. The results showed that SNCGPU compares reasonably well to the Gerbil k-mer counting algorithm on a mid-range desktop computer, but does not utilize the resources of a high-end computing platform as efficiently. The implementation of SNCGPU is available as open-source software
- …