UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Clustering FunFams using sequence embeddings improves EC purity

Littmann, M; Bordin, N; Heinzinger, M; Schütze, K; Dallago, C; Orengo, C; Rost, B; (2021) Clustering FunFams using sequence embeddings improves EC purity. Bioinformatics , 37 (20) pp. 3449-3455. 10.1093/bioinformatics/btab371. Green open access

[thumbnail of Bordin_Clustering FunFams using sequence embeddings improves EC purity_VoR.pdf]
Preview
Text
Bordin_Clustering FunFams using sequence embeddings improves EC purity_VoR.pdf - Published Version

Download (472kB) | Preview

Abstract

MOTIVATION: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be "pure", i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22,830 of 203,639) contain EC annotations and of those, 7% (1,526 of 22,830) have inconsistent functional annotations. RESULTS: We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. AVAILABILITY: Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering. SUPPLEMENTARY INFORMATION: Supplementary data are available online.

Type: Article
Title: Clustering FunFams using sequence embeddings improves EC purity
Location: England
Open access status: An open access version is available from UCL Discovery
DOI: 10.1093/bioinformatics/btab371
Publisher version: https://doi.org/10.1093/bioinformatics/btab371
Language: English
Additional information: © The Author(s) 2021. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/).
Keywords: Functional families, protein function, CATH, EC numbers, unsupervised learning, contrastive learning, word embeddings, transfer learning
UCL classification: UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences > Div of Biosciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences > Div of Biosciences > Structural and Molecular Biology
URI: https://discovery.ucl.ac.uk/id/eprint/10127993
Downloads since deposit
0Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item