Bacterial host attribution and bioinformatic characterisation of enteric bacteria Salmonella enterica and Escherichia coli from different hosts and environments
With the advent of relatively low cost whole genome sequencing (WGS), it is
now possible to obtain sequences from large numbers of bacterial strains and
interrogate their core and accessory genomes in relation to associated metadata.
While there are some bacterial species with preferred hosts, especially
in terms of disease, there has been no real systematic genomic investigation
of host and niche specificity of ’generalist’ bacteria, i.e., those that can be isolated
from multiple hosts and environments.
The main aim of this research was to determine if host and/or niche-specific
proteins can be identified for ’multi-host adapted’ bacteria such as E. coli and
Salmonella Typhimurium (STm) in order to predict the ’origin’ of a strain and its
zoonotic potential from its sequence.
Two datasets of ’multi-host’ bacteria were analysed: 1,203 STm isolates from 4
hosts (avian, bovine, human and swine) and E. coli from 6 hosts (avian, bovine,
canine, environmental, human and swine). Based on classical core genome
analysis such as core phylogeny, multilocus sequence typing and phylo-grouping,
no strong correlations with host were identified.
The accessory genome was also investigated for host-based associations, and
accessory host associated proteins (HAP) were identified for each of the bacteria/
host groups. These proteins were used to build a machine learning (ML)
classifier - support vector machine (SVM) - to predict the isolation host of the
bacterial isolates. The majority of the isolates from both species were predicted
correctly with prediction accuracy ranging from 67% to 90%. For both
bacterial species the most challenging were bovine and swine host groups
as these two had many features in common. The approach allowed not only
prediction of host based on WGS but also an assessment of how much the
genome of particular isolates resembled the features of the genomes of the
same species isolated from other hosts. This allowed ’generalist’ and ’specialist’
strains from each host group to be estimated as well as the sequences
that indicate successful transmission potential between hosts. This work also
showed that diverse collections of E. coli or STm can be used as a baseline for
prediction and quantification of zoonotic potential as was demonstrated with
E. coli O157 and Salmonella serovar Typhi. Overall this part of the research
indicated marked host restriction for both STm and E. coli, with only limited
isolate subsets exhibiting host promiscuity based on predicted protein content.
ML can be successfully applied to interrogate source attribution of bacterial
isolates and has the capacity to predict zoonotic potential.
Using the same ML approach, another question was asked about how similar
are the known zoonotic pathogens. When studied apart, E. coli O157 can be
classified further into human and bovine isolates with only a small proportion
of bovine isolates predicted as ’human’, pointing to the specific cattle strains
that are potentially a more serious threat to human health. This approach was
tested with 2 independent sets of O157 human outbreak strains with traced-back
isolates from animals and food. The outbreak strains independent of the
origin were scored as ’human’. This finding has profound implications for public
health management of disease because interventions in cattle, such a vaccination,
could be targeted at herds carrying strains of high zoonotic potential.
The final section the thesis research was based on the STm dataset and compared
different ML approaches to test which algorithm performed best for host
prediction. Dimensionality reduction techniques as well as unsupervised and
supervised ML were applied to HAP. Dimensionality reduction techniques and
unsupervised ML were not able to split the dataset by host and produced different
results which could be challenging to interpret correctly in terms of biological significance of the factors that influenced clustering. On the other hand,
all three supervised classifiers resulted in very comparable high levels of prediction
(over 95%). Thus, the choice of supervised classifier for host prediction
should be based on the knowledge of the end-user as well as on requirements
for any further analysis.
To conclude, accessory genomes were successfully used for extraction of host
associated proteins as well as for prediction of source host and quantification of
zoonotic potential for bacteria species that can be isolated from multiple hosts.
The methods described here can be applied to other bacteria and overall have
implications for monitoring, identification and targeted interventions associated
with potentially zoonotic infections. The results are completely dependent on
the dataset quality which should be as large and diverse as possible. The research
highlights the predictive potential of such algorithms but also the need
for bacterial sequences to be gathered with as much useful metadata as possible,
including isolation host