UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Efficient Graph Construction for Similarity Search on High Dimensional Data

Kanthan, Leslie; (2019) Efficient Graph Construction for Similarity Search on High Dimensional Data. Doctoral thesis (Ph.D), UCL (University College London). Green open access

[thumbnail of Kanthan_10080644_thesis.pdf]
Preview
Text
Kanthan_10080644_thesis.pdf

Download (9MB) | Preview

Abstract

The K nearest neighbours graph, denoted KNNG, is an essential graph in data mining and machine learning. However, despite its vital significance, exact construction of this graph for high dimensional datasets (d 10) is inefficient (O(n2) computational complexity). Approximate algorithms have been shown to improve upon this complexity, but compromise accuracy. In this thesis, we focus on automatically improving existing locality sensitive hashing schemes and proposing new schemes that find good trade-offs between accuracy and speed. We investigate how to obtain an LSH version with a guaranteed worst-case subquadratic cost that minimises the loss of accuracy. We implement such an algorithm and evaluate its runtime impact for different types of datasets. We implement the most popular versions and perform a detailed experimental comparison and present trends between specific LSH versions and the input dataset characteristics. Relying on the findings of this analysis, we propose Variable Radius LSH (VRLSH), a new LSH scheme that is suitable for distributed computation and capable of handling large datasets. We show how VRLSH can scale efficiently with the size of the dataset, and how it can improve the accuracy of the generated KNNG. Next, we propose three new LSH schemes that rely on the strategy of imitating biological systems. In particular, we propose RFLY, PFLY and DPFLY three schemes inspired by FLY-LSH, a recent variation of the LSH algorithm that relies on the olfactory circuit of flies, used to identify similar odours. We first experiment and expand FLY-LSH by running it on a larger number of datasets. The three proposed algorithms improve both the accuracy and the applicability of FLY-LSH on real datasets. Firstly, RFLY improves the accuracy of the generated graph by 10%. Then PFLY distributes data more appropriately in a pre-fixed number of buckets, while concurrently improving the accuracy of the generated graph. Thirdly, DPFLY adapts random projects to the input dataset, achieving 15% improvement. Hitherto, we propose a novel optimisation framework that uses machine learning techniques and genetic algorithms to automatically select a pareto frontier tuned version of the LSH schemes for a given specific input dataset. In our experiments, our optimisation framework improves the performance (both speed and accuracy) for every version of the LSH algorithm by 10% and 13% respectively. Last, we discuss future work and how the findings of this thesis can further help the research community.

Type: Thesis (Doctoral)
Qualification: Ph.D
Title: Efficient Graph Construction for Similarity Search on High Dimensional Data
Event: UCL (University College London)
Open access status: An open access version is available from UCL Discovery
Language: English
Additional information: Copyright © The Author 2019. Original content in this thesis is licensed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) Licence (https://creativecommons.org/licenses/by/4.0/). Any third-party copyright material present remains the property of its respective owner(s) and is licensed under its existing terms. Access may initially be restricted at the author’s request.
UCL classification: UCL
UCL > Provost and Vice Provost Offices
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery.ucl.ac.uk/id/eprint/10080644
Downloads since deposit
328Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item