PaSiT: A novel approach based on short oligo-nucleotide frequencies for efficient bacterial identification and typing.
- Publication date
- 2020
- Publisher
Abstract
Motivation: One of the most widespread methods used in taxonomy studies to distinguish between strains or taxa
is the calculation of average nucleotide identity. It requires a computationally expensive alignment step and is therefore
not suitable for large-scale comparisons. Short oligonucleotide-based methods do offer a faster alternative but
at the expense of accuracy. Here, we aim to address this shortcoming by providing a software that implements a
novel method based on short-oligonucleotide frequencies to compute inter-genomic distances.
Results: Our tetranucleotide and hexanucleotide implementations, which were optimized based on a taxonomically
well-defined set of over 200 newly sequenced bacterial genomes, are as accurate as the short oligonucleotide-based
method TETRA and average nucleotide identity, for identifying bacterial species and strains, respectively. Moreover,
the lightweight nature of this method makes it applicable for large-scale analyses.
Availability and implementation: The method introduced here was implemented, together with other existing methods,
in a dependency-free software written in C, GenDisCal, available as source code from https://github.com/LMUGent/
GenDisCal. The software supports multithreading and has been tested on Windows and Linux (CentOS). In
addition, a Java-based graphical user interface that acts as a wrapper for the software is also available