2 research outputs found

    Genome sequence-based virus taxonomy using machine learning

    Get PDF
    Virus taxonomy is the task of partitioning the world of viruses into a coherent scheme of easily recognisable entities, with the major purpose of answering the everyday needs of practising virologists. Traditional approaches involve a lengthy process, done case by case through proposals by experienced virologists. With rapid advances in sequencing technology generating large numbers of virus genome se- quences at an ever increasing rate, genome sequences are often the only information available for a virus in many situations. Traditional approaches are unable to han- dle this tsunami of data and to incorporate the newly identified viruses into existing systems in a timely and efficient manner. Thus, automated methods for classifying viruses given only the primary struc- ture of genomes are needed to aid the work of taxonomists. This thesis contributes to the application of machine learning techniques to genome sequence-based virus taxonomy. Specifically, we apply machine learning techniques to classify the NCBI reference sequences of virus model species into seven Baltimore Classes, four host groups or hundreds of ICTV hierarchical classes. We provide visualisations of a virus genome sequence dataset using various techniques and highlight properties of composition- and location-related nucleotide statistics, and statistics of the dataset as a whole. The thesis also provides a systematic experimental framework for apply- ing machine learning techniques to virus taxonomy. Using the framework, we study the predictive power of various features of virus genome sequences and classifiers in multi-class classification, from simple single variable statistics to sophisticated high dimensional representations, from simple k-NN classifiers to more advanced SVM, RF and graph-based SSL methods. With optimised experimental factors, our results outperform the current state of the art. In addition, we identify individual virus sequences that are frequently mislabelled by automated methods, study their memberships and provide predictions for currently unlabelled sequences using the best methods in our study. Finally, we extend the methods established in multi- class classification to the hierarchical classification problem of predicting ICTV taxonomic classes, which involves hundreds classes, many of them having very few samples per class. We find that both hierarchical and SSL approaches can improve performance in the task of virus genome classification
    corecore