Human protein function prediction: application of machine learning for integration of heterogeneous data sources.
Doctoral thesis, UCL (University College London).
Experimental characterisation of protein cellular function can be prohibitively expensive and take years to complete. To address this problem, this thesis focuses on the development of computational approaches to predict function from sequence. For sequences with well characterised close relatives, annotation is trivial, orphans or distant homologues present a greater challenge. The use of a feature based method employing ensemble support vector machines to predict individual Gene Ontology classes is investigated. It is found that different combinations of feature inputs are required to recognise different functions. Although the approach is applicable to any human protein sequence, it is restricted to broadly descriptive functions. The method is well suited to prioritisation of candidate functions for novel proteins rather than to make highly accurate class assignments. Signatures of common function can be derived from different biological characteristics; interactions and binding events as well as expression behaviour. To investigate the hypothesis that common function can be derived from expression information, public domain human microarray datasets are assembled. The questions of how best to integrate these datasets and derive features that are useful in function prediction are addressed. Both co-expression and abundance information is represented between and within experiments and investigated for correlation with function. It is found that features derived from expression data serve as a weak but significant signal for recognising functions. This signal is stronger for biological processes than molecular function categories and independent of homology information. The protein domain has historically been coined as a modular evolutionary unit of protein function. The occurrence of domains that can be linked by ancestral fusion events serves as a signal for domain-domain interactions. To exploit this information for function prediction, novel domain architecture and fused architecture scores are developed. Architecture scores rather than single domain scores correlate more strongly with function, and both architecture and fusion scores correlate more strongly with molecular functions than biological processes. The final study details the development of a novel heterogeneous function prediction approach designed to target the annotation of both homologous and non-homologous proteins. Support vector regression is used to combine pair-wise sequence features with expression scores and domain architecture scores to rank protein pairs in terms of their functional similarities. The target of the regression models represents the continuum of protein function space empirically derived from the Gene Ontology molecular function and biological process graphs. The merit and performance of the approach is demonstrated using homologous and non-homologous test datasets and significantly improves upon classical nearest neighbour annotation transfer by sequence methods. The final model represents a method that achieves a compromise between high specificity and sensitivity for all human proteins regardless of their homology status. It is expected that this strategy will allow for more comprehensive and accurate annotations of the human proteome.
|Title:||Human protein function prediction: application of machine learning for integration of heterogeneous data sources|
|Open access status:||An open access version is available from UCL Discovery|
|UCL classification:||UCL > School of Life and Medical Sciences > Faculty of Life Sciences > Biosciences (Division of) > Structural and Molecular Biology|
Archive Staff Only