UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Learning a Functional Grammar of Protein Domains using Natural Language Word Embedding Techniques

Buchan, DW; Jones, DT; (2020) Learning a Functional Grammar of Protein Domains using Natural Language Word Embedding Techniques. Proteins , 88 (4) pp. 616-624. 10.1002/prot.25842. Green open access

[thumbnail of word2vec_final.pdf]
Preview
Text
word2vec_final.pdf - Accepted Version

Download (835kB) | Preview

Abstract

In this paper, using word2vec, a widely-used natural language processing method, we demonstrate that proteins domains may have a learnable implicit semantic "meaning" in the context of their functional contributions to multi-domain proteins in which they are found. Word2vec is a group of models which can be used to produce semantically meaningful embeddings of words or tokens in a fixed-dimension vector space. In this work, we treat multi-domain proteins as "sentences" where domain identifiers are tokens which may be considered as "words". Using all InterPro [1] pfam domain assignments we observe that the embedding could be used to suggest putative GO assignments for Pfam [2] Domains of Unknown Function. This article is protected by copyright. All rights reserved.

Type: Article
Title: Learning a Functional Grammar of Protein Domains using Natural Language Word Embedding Techniques
Location: United States
Open access status: An open access version is available from UCL Discovery
DOI: 10.1002/prot.25842
Publisher version: https://doi.org/10.1002/prot.25842
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
Keywords: Semantic embedding, function prediction, machine learning, protein domains, word2vec
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery.ucl.ac.uk/id/eprint/10086769
Downloads since deposit
357Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item