UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Parallel and scalable precise clustering

Byma, S; Dhasade, A; Altenhoff, A; Dessimoz, C; Larus, JR; (2020) Parallel and scalable precise clustering. In: PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. (pp. pp. 217-228). ACM: Virtual event. Green open access

[thumbnail of Dessimoz_751214v1.full.pdf]
Preview
Text
Dessimoz_751214v1.full.pdf - Accepted Version

Download (900kB) | Preview

Abstract

This paper describes a new technique for parallelizing protein clustering, an important bioinformatics computation for the analysis of protein sequences. Protein clustering identifies groups of proteins that are similar because they share long sequences of similar amino acids. Given a collection of protein sequences, clustering can significantly reduce the computational effort required to identify all similar sequences by avoiding many negative comparisons. The challenge, however, is to build a clustering that misses as few similar sequences (or elements, more generally) as possible. In this paper, we introduce precise clustering, a property that requires each pair of similar elements to appear together in at least one cluster. We show that transitivity in the data can be leveraged to merge clusters while maintaining a precise clustering, providing a basis for independently forming clusters. This allows us reformulate clustering as a bottom-up merge of independent clusters in a new algorithm called ClusterMerge. ClusterMerge exposes parallelism, enabling fast and scalable implementations. We apply ClusterMerge to find similar amino acid sequences in a collection o comparison, with only half as many comparisons. More importantly, ClusterMerge is highly amenable to parallel and distributed computation. Our implementation achieves a speedup of 604 times on 768 cores (1400 times faster than a comparable single-threaded clustering implementation), a strong scaling efficiency of 90%, and a weak scaling efficiency of nearly 100%.

Type: Proceedings paper
Title: Parallel and scalable precise clustering
Event: PACT '20: International Conference on Parallel Architectures and Compilation Techniques
Open access status: An open access version is available from UCL Discovery
DOI: 10.1145/3410463.3414646
Publisher version: https://doi.org/10.1145/3410463.3414646
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
UCL classification: UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences > Div of Biosciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences > Div of Biosciences > Genetics, Evolution and Environment
URI: https://discovery.ucl.ac.uk/id/eprint/10117937
Downloads since deposit
61Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item