On the Use of Evaluation Measures for Defect Prediction Studies

Advanced search
Browse by:

Department | Year

UCL Theses | Latest

Deposit your research

On the Use of Evaluation Measures for Defect Prediction Studies

Moussa, Rebecca; Sarro, Federica; (2022) On the Use of Evaluation Measures for Defect Prediction Studies. In: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA'22). Association for Computing Machinery (ACM) (In press). Green open access

Preview

Text
MoussaISSTA22.pdf - Accepted Version
Download (1MB) | Preview

Abstract

Software defect prediction research has adopted various evaluation measures to assess the performance of prediction models. In this paper, we further stress on the importance of the choice of appropriate measures in order to correctly assess strengths and weaknesses of a given defect prediction model, especially given that most of the defect prediction tasks suffer from data imbalance. Investigating 111 previous studies published between 2010 and 2020, we found out that over a half either use only one evaluation measure, which alone cannot express all the characteristics of model performance in presence of imbalanced data, or a set of binary measures which are prone to be biased when used to assess models especially when trained with imbalanced data. We also unveil the magnitude of the impact of assessing popular defect prediction models with several evaluation measures based, for the first time, on both statistical significance test and effect size analyses. Our results reveal that the evaluation measures produce a different ranking of the classification models in 82% and 85% of the cases studied according to the Wilcoxon statistical significance test and Â12 effect size, respectively. Further, we observe a very high rank disruption (between 64% to 92% on average) for each of the measures investigated. This signifies that, in the majority of the cases, a prediction technique that would be believed to be better than others when using a given evaluation measure becomes worse when using a different one. We conclude by providing some recommendations for the selection of appropriate evaluation measures based on factors which are specific to the problem at hand such as the class distribution of the training data, the way in which the model has been built and will be used. Moreover, we recommend to include in the set of evaluation measures, at least one able to capture the full picture of the confusion matrix, such as MCC. This will enable researchers to assess whether proposals made in previous work can be applied for purposes different than the ones they were originally intended for. Besides, we recommend to report, whenever possible, the raw con- fusion matrix to allow other researchers to compute any measure of interest thereby making it feasible to draw meaningful observations across different studies.

Type:	Proceedings paper
Title:	On the Use of Evaluation Measures for Defect Prediction Studies
Event:	The 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA'22)
Location:	Daejeon, South Korea
Dates:	18th-22nd July 2022
Open access status:	An open access version is available from UCL Discovery
DOI:	10.1145/3533767.3534405
Publisher version:	https://conf.researchr.org/home/issta-2022
Language:	English
Additional information:	This version is the author accepted manuscript. For information on re-use, please refer to the publisher's terms and conditions.
Keywords:	Software Defect Prediction, Evaluation Measures
UCL classification:	UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science UCL > Provost and Vice Provost Offices > UCL BEAMS UCL
URI:	https://discovery.ucl.ac.uk/id/eprint/10149866

Downloads since deposit

274Downloads

Download activity - last month

Download activity - last 12 months

Downloads by country - last 12 months

Archive Staff Only

View Item