Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study

Advanced search
Browse by:

Department | Year

UCL Theses | Latest

Deposit your research

Bookmark & Share

Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study

Huang, J; Keung, JW; Sarro, F; Li, Y-F; Yu, YT; Chan, WK; Sun, H; (2017) Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study. Journal of Systems and Software , 132 pp. 226-252. 10.1016/j.jss.2017.07.012. Green open access

Preview

Text
JSS17.pdf - Accepted Version
Download (10MB) | Preview

Abstract

Being able to predict software quality is essential, but also it pose significant challenges in software engineering. Historical software project datasets are often being utilized together with various machine learning algorithms for fault-proneness classification. Unfortunately, the missing values in datasets have negative impacts on the estimation accuracy and therefore, could lead to inconsistent results. As a method handling missing data, K nearest neighbor (KNN) imputation gradually gains acceptance in empirical studies by its exemplary performance and simplicity. To date, researchers still call for optimized parameter setting for KNN imputation to further improve its performance. In the work, we develop a novel incomplete-instance based KNN imputation technique, which utilizes a cross-validation scheme to optimize the parameters for each missing value. An experimental assessment is conducted on eight quality datasets under various missingness scenarios. The study also compared the proposed imputation approach with mean imputation and other three KNN imputation approaches. The results show that our proposed approach is superior to others in general. The relatively optimal fixed parameter settings for KNN imputation for software quality data is also determined. It is observed that the classification accuracy is improved or at least maintained by using our approach for missing data imputation.

Type:	Article
Title:	Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study
Open access status:	An open access version is available from UCL Discovery
DOI:	10.1016/j.jss.2017.07.012
Publisher version:	https://doi.org/10.1016/j.jss.2017.07.012
Language:	English
Additional information:	This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
Keywords:	Empirical software engineering estimation, KNN, Imputation, Cross-validation, Missing data
UCL classification:	UCL UCL > Provost and Vice Provost Offices > UCL BEAMS UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI:	https://discovery.ucl.ac.uk/id/eprint/10043051

Downloads since deposit

506Downloads

Download activity - last month

Download activity - last 12 months

Downloads by country - last 12 months

Archive Staff Only

View Item