UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

K-mer based prediction of Clostridioides difficile relatedness and ribotypes

Moore, Matthew P; Wilcox, Mark H; Walker, Ann; Eyre, David; (2022) K-mer based prediction of Clostridioides difficile relatedness and ribotypes. Microbial Genomics , 8 , Article 000804. 10.1099/mgen.0.000804. Green open access

[thumbnail of Walker_K-mer based prediction of Clostridioides difficile relatedness and ribotypes_VoR.pdf]
Preview
Text
Walker_K-mer based prediction of Clostridioides difficile relatedness and ribotypes_VoR.pdf

Download (1MB) | Preview

Abstract

Comparative analysis of Clostridioides difficile whole-genome sequencing (WGS) data enables fine scaled investigation of transmission and is increasingly becoming part of routine surveillance. However, these analyses are constrained by the computational requirements of the large volumes of data involved. By decomposing WGS reads or assemblies into k-mers and using the dimensionality reduction technique MinHash, it is possible to rapidly approximate genomic distances without alignment. Here we assessed the performance of MinHash, as implemented by sourmash, in predicting single nucleotide differences between genomes (SNPs) and C. difficile ribotypes (RTs). For a set of 1905 diverse C. difficile genomes (differing by 0–168 519 SNPs), using sourmash to screen for closely related genomes, at a sensitivity of 100 % for pairs ≤10 SNPs, sourmash reduced the number of pairs from 1 813 560 overall to 161 934, i.e. by 91 %, with a positive predictive value of 32 % to correctly identify pairs ≤10 SNPs (maximum SNP distance 4144). At a sensitivity of 95 %, pairs were reduced by 94 % to 108 266 and PPV increased to 45 % (maximum SNP distance 1009). Increasing the MinHash sketch size above 2000 produced minimal performance improvement. We also explored a MinHash similarity-based ribotype prediction method. Genomes with known ribotypes (n=3937) were split into a training set (2937) and test set (1000) randomly. The training set was used to construct a sourmash index against which genomes from the test set were compared. If the closest five genomes in the index had the same ribotype this was taken to predict the searched genome’s ribotype. Using our MinHash ribotype index, predicted ribotypes were correct in 780/1000 (78 %) genomes, incorrect in 20 (2 %), and indeterminant in 200 (20 %). Relaxing the classifier to 4/5 closest matches with the same RT improved the correct predictions to 87 %. Using MinHash it is possible to subsample C. difficile genome k-mer hashes and use them to approximate small genomic differences within minutes, significantly reducing the search space for further analysis.

Type: Article
Title: K-mer based prediction of Clostridioides difficile relatedness and ribotypes
Open access status: An open access version is available from UCL Discovery
DOI: 10.1099/mgen.0.000804
Publisher version: https://doi.org/10.1099/mgen.0.000804
Language: English
Additional information: © 2022 The Authors. This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/deed.ast).
Keywords: Clostridioides difficile, k-mer, MASH, ribotype
UCL classification: UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > Inst of Clinical Trials and Methodology > MRC Clinical Trials Unit at UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > Inst of Clinical Trials and Methodology
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
URI: https://discovery.ucl.ac.uk/id/eprint/10144638
Downloads since deposit
47Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item