UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.

Rossier, V; Vesztrocy, AW; Robinson-Rechavi, M; Dessimoz, C; (2021) OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches. Bioinformatics 10.1093/bioinformatics/btab219. (In press). Green open access

[thumbnail of Desssimoz_OMAmer- tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches_AOP.pdf]
Preview
Text
Desssimoz_OMAmer- tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches_AOP.pdf - Published Version

Download (1MB) | Preview

Abstract

MOTIVATION: Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive. RESULTS: Here, we first show that in multiple animal and plant datasets, 18 to 62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily-informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND. AVAILABILITY: OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Type: Article
Title: OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.
Location: England
Open access status: An open access version is available from UCL Discovery
DOI: 10.1093/bioinformatics/btab219
Publisher version: https://doi.org/10.1093/bioinformatics/btab219
Language: English
Additional information: © The Author(s) 2021. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
UCL classification: UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences > Div of Biosciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences > Div of Biosciences > Genetics, Evolution and Environment
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Maths and Physical Sciences
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Maths and Physical Sciences > Dept of Mathematics
URI: https://discovery.ucl.ac.uk/id/eprint/10127163
Downloads since deposit
66Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item