2 research outputs found
Developing bioinformatics approaches for the analysis of influenza virus whole genome sequence data
Influenza viruses represent a major public health burden worldwide, resulting in an estimated 500,000
deaths per year, with potential for devastating pandemics. Considerable effort is expended in the
surveillance of influenza, including major World Health Organization (WHO) initiatives such as the
Global Influenza Surveillance and Response System (GISRS). To this end, whole-genome sequencning (WGS), and corresponding bioinformatics pipelines, have emerged as powerful tools. However,
due to the inherent diversity of influenza genomes, circulation in several different host species, and
noise in short-read data, several pitfalls can appear during bioinformatics processing and analysis.
2.1.2 Results
Conventional mapping approaches can be insufficient when a sub-optimal reference strain is chosen.
For short-read datasets simulated from human-origin influenza H1N1 HA sequences, read recovery
after single-reference mapping was routinely as low as 90% for human-origin influenza sequences,
and often lower than 10% for those from avian hosts. To this end, I developed software using de Bruijn
47Graphs (DBGs) for classification of influenza WGS datasets: VAPOR. In real data benchmarking
using 257 WGS read sets with corresponding de novo assemblies, VAPOR provided classifications
for all samples with a mean of >99.8% identity to assembled contigs. This resulted in an increase
of the number of mapped reads by 6.8% on average, up to a maximum of 13.3%. Additionally, using
simulations, I demonstrate that classification from reads may be applied to detection of reassorted
strains.
2.1.3 Conclusions
The approach used in this study has the potential to simplify bioinformatics pipelines for surveillance,
providing a novel method for detection of influenza strains of human and non-human origin directly
from reads, minimization of potential data loss and bias associated with conventional mapping, and
facilitating alignments that would otherwise require slow de novo assembly. Whilst with expertise and
time these pitfalls can largely be avoided, with pre-classification they are remedied in a single step.
Furthermore, this algorithm could be adapted in future to surveillance of other RNA viruses. VAPOR
is available at https://github.com/connor-lab/vapor. Lastly, VAPOR could be improved by future
implementation in C++, and should employ more efficient methods for DBG representation