UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Genome sequence-based virus taxonomy using machine learning

Wang, T; (2017) Genome sequence-based virus taxonomy using machine learning. Doctoral thesis (Ph.D), UCL (University College London). Green open access

[thumbnail of Wang_10027741_Genome sequence-based virus taxonomy using machine learning.pdf]
Preview
Text
Wang_10027741_Genome sequence-based virus taxonomy using machine learning.pdf

Download (10MB) | Preview

Abstract

Virus taxonomy is the task of partitioning the world of viruses into a coherent scheme of easily recognisable entities, with the major purpose of answering the everyday needs of practising virologists. Traditional approaches involve a lengthy process, done case by case through proposals by experienced virologists. With rapid advances in sequencing technology generating large numbers of virus genome se- quences at an ever increasing rate, genome sequences are often the only information available for a virus in many situations. Traditional approaches are unable to han- dle this tsunami of data and to incorporate the newly identified viruses into existing systems in a timely and efficient manner. Thus, automated methods for classifying viruses given only the primary struc- ture of genomes are needed to aid the work of taxonomists. This thesis contributes to the application of machine learning techniques to genome sequence-based virus taxonomy. Specifically, we apply machine learning techniques to classify the NCBI reference sequences of virus model species into seven Baltimore Classes, four host groups or hundreds of ICTV hierarchical classes. We provide visualisations of a virus genome sequence dataset using various techniques and highlight properties of composition- and location-related nucleotide statistics, and statistics of the dataset as a whole. The thesis also provides a systematic experimental framework for apply- ing machine learning techniques to virus taxonomy. Using the framework, we study the predictive power of various features of virus genome sequences and classifiers in multi-class classification, from simple single variable statistics to sophisticated high dimensional representations, from simple k-NN classifiers to more advanced SVM, RF and graph-based SSL methods. With optimised experimental factors, our results outperform the current state of the art. In addition, we identify individual virus sequences that are frequently mislabelled by automated methods, study their memberships and provide predictions for currently unlabelled sequences using the best methods in our study. Finally, we extend the methods established in multi- class classification to the hierarchical classification problem of predicting ICTV taxonomic classes, which involves hundreds classes, many of them having very few samples per class. We find that both hierarchical and SSL approaches can improve performance in the task of virus genome classification.

Type: Thesis (Doctoral)
Qualification: Ph.D
Title: Genome sequence-based virus taxonomy using machine learning
Event: University College London
Open access status: An open access version is available from UCL Discovery
Language: English
UCL classification: UCL > Provost and Vice Provost Offices
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery.ucl.ac.uk/id/eprint/10027741
Downloads since deposit
266Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item