UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Recovering the number of clusters in data sets with noise features using feature rescaling factors

de Amorim, RC; Hennig, C; (2015) Recovering the number of clusters in data sets with noise features using feature rescaling factors. Information Sciences , 324 pp. 126-145. 10.1016/j.ins.2015.06.039. Green open access

[thumbnail of Henning_1475072_amorimhennigapr15R1fordiscovery.pdf]
Preview
Text
Henning_1475072_amorimhennigapr15R1fordiscovery.pdf - Accepted Version

Download (300kB) | Preview

Abstract

In this paper we introduce three methods for re-scaling data sets aiming at improving the likelihood of clustering validity indexes to return the true number of spherical Gaussian clusters with additional noise features. Our method obtains feature re-scaling factors taking into account the structure of a given data set and the intuitive idea that different features may have different degrees of relevance at different clusters. We experiment with the Silhouette (using squared Euclidean, Manhattan, and the pth power of the Minkowski distance), Dunn’s, Calinski–Harabasz and Hartigan indexes on data sets with spherical Gaussian clusters with and without noise features. We conclude that our methods indeed increase the chances of estimating the true number of clusters in a data set.

Type: Article
Title: Recovering the number of clusters in data sets with noise features using feature rescaling factors
Open access status: An open access version is available from UCL Discovery
DOI: 10.1016/j.ins.2015.06.039
Publisher version: http://dx.doi.org/10.1016/j.ins.2015.06.039
Language: English
Additional information: © 2015 Elsevier Inc. All rights reserved. This manuscript is made available under a Creative Commons Attribution Non-commercial Non-derivative 4.0 International license (CC BY-NC-ND 4.0). This license allows you to share, copy, distribute and transmit the work for personal and non-commercial use providing author and publisher attribution is clearly stated. Further details about CC BY licenses are available at http://creativecommons.org/ licenses/by/4.0. Access may be initially restricted by the publisher.
Keywords: Feature re-scaling, Clustering, K-Means, Cluster validity index, Feature weighting
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Maths and Physical Sciences
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Maths and Physical Sciences > Dept of Statistical Science
URI: https://discovery.ucl.ac.uk/id/eprint/1475072
Downloads since deposit
91Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item