TY - UNPB
EP - 155
Y1 - 2007/01/31/
AV - public
TI - Support vector machines for drug discovery
N1 - Thesis digitised by ProQuest. Third party copyright material has been removed from the ethesis. Images identifying individuals have been redacted or partially redacted to protect their identity.
PB - UCL (University College London)
UR - https://discovery.ucl.ac.uk/id/eprint/1445885/
ID - discovery1445885
N2 - Support vector machines (SVMs) have displayed good predictive accuracy on a wide range of classification tasks and are inherently adaptable to complex problem domains. Structure-property correlation (SPC) analysis is a vital part of the contemporary drug discovery process, in which several components of the search for novel molecular compounds with therapeutic potential may be performed by computer (in silicd). Inferred relationships between molecular structure and biological properties of interest are used to eliminate compounds unsuitable for further development. In order to improve process efficiency without rejecting useful compounds, predictive accuracy of such relationships must remain high despite a paucity of data from which to infer them. This thesis describes the application of SVMs to SPC analysis and investigates methods with which to enhance performance and facilitate integration of the technique into present practice. Overviews of contemporary drug discovery and the role of machine learning place the investigation into context. Computational discrimination between compounds according to their structures and properties of interest is described in detail, as is the SVM algorithm. A framework for the assessment of supervised machine learning performance on SPC data is proposed and employed to assess SVM performance alongside state-of-the-art techniques for in silico SPC analysis on data provided by GlaxoSmithKline. SVM performance is competitive and the comparison prompts adaptations of both data treatment and algorithmic application to explore the effects of data paucity, class imbalance and outlying data. Subsequent work weights the SVM kernel matrix to recognise heavily populated regions of training data and suggests the incorporation of domain-specific clustering methods to assist the standard SVM algorithm. The notion that SVM kernel functions may incorporate existing domain-specific methods leads to kernel functions that employ existing pharmaceutical similarity measures to treat an abstract, binary representation of molecular structure that is not used widely for SPC analysis.
A1 - Trotter, MWB
M1 - Doctoral
ER -