PlasmidTron: assembling the cause of phenotypes from NGS data

Abstract

AbstractWhen defining bacterial populations through whole genome sequencing (WGS) the samples often have detailed associated metadata that relate to disease severity, antimicrobial resistance, or even rare biochemical traits. When comparing these bacterial populations, it is apparent that some of these phenotypes do not follow the phylogeny of the host i.e. they are genetically unlinked to the evolutionary history of the host bacterium. One possible explanation for this phenomenon is that the genes are moving independently between hosts and are likely associated with mobile genetic elements (MGE). However, identifying the element that is associated with these traits can be complex if the starting point is short read WGS data. With the increased use of next generation WGS in routine diagnostics, surveillance and epidemiology a vast amount of short read data is available and these types of associations are relatively unexplored. One way to address this would be to perform assembly de novo of the whole genome read data, including its MGEs. However, MGEs are often full of repeats and can lead to fragmented consensus sequences. Deciding which sequence is part of the chromosome, and which is part of a MGE can be ambiguous. We present PlasmidTron, which utilises the phenotypic data normally available in bacterial population studies, such as antibiograms, virulence factors, or geographic information, to identify sequences that are likely to represent MGEs linked to the phenotype. Given a set of reads, categorised into cases (showing the phenotype) and controls (phylogenetically related but phenotypically negative), PlasmidTron can be used to assemble de novo reads from each sample linked by a phenotype. A k-mer based analysis is performed to identify reads associated with a phylogenetically unlinked phenotype. These reads are then assembled de novo to produce contigs. By utilising k-mers and only assembling a fraction of the raw reads, the method is fast and scalable to large datasets. This approach has been tested on plasmids, because of their contribution to important pathogen associated traits, such as AMR, hence the name, but there is no reason why this approach cannot be utilized for any MGE that can move independently through a bacterial population. PlasmidTron is written in Python 3 and available under the open source licence GNU GPL3 from https://github.com/sanger-pathogens/plasmidtron.DATA SUMMARYSource code for PlasmidTron is available from Github under the open source licence GNU GPL 3; (url - https://goo.gl/ot6rT5)Simulated raw reads files have been deposited in Figshare; (url - https://doi.org/10.6084/m9.figshare.5406355.vl)Salmonella enterica serovar Weltevreden strain VNS10259 is available from GenBank; accession number GCA_001409135.Salmonella enterica serovar Typhi strain BL60006 is available from GenBank; accession number GCA_900185485.Accession numbers for all of the Illumina datasets used in this paper are listed in the supplementary tables.I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. ⊠IMPACT STATEMENTPlasmidTron utilises the phenotypic data normally available in bacterial population studies, such as antibiograms, virulence factors, or geographic information, to identify sequences that are likely to represent MGEs linked to the phenotype.</jats:sec

    Similar works