UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Contrastive learning on protein embeddings enlightens midnight zone

Heinzinger, Michael; Littmann, Maria; Sillitoe, Ian; Bordin, Nicola; Orengo, Christine; Rost, Burkhard; (2022) Contrastive learning on protein embeddings enlightens midnight zone. NAR Genomics and Bioinformatics , 4 (2) , Article lqac043. 10.1093/nargab/lqac043. Green open access

[thumbnail of Contrastive learning on protein embeddings enlightens midnight zone.pdf]
Preview
PDF
Contrastive learning on protein embeddings enlightens midnight zone.pdf - Published Version

Download (2MB) | Preview

Abstract

Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.

Type: Article
Title: Contrastive learning on protein embeddings enlightens midnight zone
Location: England
Open access status: An open access version is available from UCL Discovery
DOI: 10.1093/nargab/lqac043
Publisher version: https://doi.org/10.1093/nargab/lqac043
Language: English
Additional information: © The Author(s) 2022. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
UCL classification: UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences > Div of Biosciences > Structural and Molecular Biology
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences > Div of Biosciences
URI: https://discovery.ucl.ac.uk/id/eprint/10150829
Downloads since deposit
50Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item