ISSTA 2022
Mon 18 - Fri 22 July 2022 Online

Software defect prediction research has adopted various evaluation measures to assess the performance of prediction models. In this paper, we further stress on the importance of the choice of appropriate measures in order to correctly assess strengths and weaknesses of a given defect prediction model, specially given that most of the defect prediction tasks suffer from data imbalance.

Investigating 111 previous studies published between 2010 and 2020, we found out that over a half either use only one evaluation measure, which alone cannot express all the characteristics of model performance in presence of imbalanced data, or a set of binary measures which are prone to be biased when used to assess models especially when trained with imbalanced data.

We also unveil the magnitude of the impact of assessing popular defect prediction models with several evaluation measures based, for the first time, on both statistical significance test and effect size analyses. Our results reveal that the evaluation measures produce a different ranking of the classification models in 82% and 85% of the cases studied according to the Wilcoxon statistical significance test and A12 effect size, respectively. Further, we observe a very high rank disruption (between 61% to 90% on average) for each of the measures investigated. This signifies that, in the majority of the cases, a prediction technique that would be believed to be better than others when using a given evaluation measure becomes worse when using a different one.

We conclude by providing some recommendations for the selection of appropriate evaluation measures based on on factors which are specific to the problem at hand such as the class distribution of the training data, the way in which the model has been built and will be used. Moreover, we recommend to include in the set of evaluation measures, at least one able to capture the full picture of the confusion matrix, such as MCC. This will enable researchers to assess whether proposals made in previous work can be applied for purposes different than the ones they were originally intended for. Besides, we recommend to report, whenever possible, the raw confusion matrix to allow other researchers to compute any measure of interest thereby making it feasible to draw meaningful observations across different studies.

Wed 20 Jul

Displayed time zone: Seoul change

01:20 - 02:40
Session 1-1: Oracles, Models, and Measurement DTechnical Papers at ISSTA 1
01:20
20m
Talk
Combining Solution Reuse and Bound Tightening for Efficient Analysis of Evolving SystemsACM SIGSOFT Distinguished Paper
Technical Papers
Clay Stevens University of Nebraska-Lincoln, Hamid Bagheri University of Nebraska-Lincoln
DOI
01:40
20m
Talk
Evolution-Aware Detection of Order-Dependent Flaky Tests
Technical Papers
Chengpeng Li University of Texas at Austin, August Shi University of Texas at Austin
DOI
02:00
20m
Talk
jTrans: Jump-Aware Transformer for Binary Code Similarity Detection
Technical Papers
Hao Wang Tsinghua University, Wenjie Qu Huazhong University of Science and Technology, Gilad Katz Ben-Gurion University of the Negev, Wenyu Zhu Tsinghua University, Zeyu Gao University of Science and Technology of China, Han Qiu Tsinghua University, Jianwei Zhuge Tsinghua University, Chao Zhang Tsinghua University
DOI Pre-print
02:20
20m
Talk
On the Use of Evaluation Measures for Defect Prediction Studies
Technical Papers
Rebecca Moussa University College London, Federica Sarro University College London
DOI
07:00 - 08:20
Session 2-1: Oracles, Models, and Measurement ETechnical Papers at ISSTA 1
Chair(s): Christoph Csallner University of Texas at Arlington
07:00
20m
Talk
On the Use of Evaluation Measures for Defect Prediction Studies
Technical Papers
Rebecca Moussa University College London, Federica Sarro University College London
DOI
07:20
20m
Talk
Combining Solution Reuse and Bound Tightening for Efficient Analysis of Evolving SystemsACM SIGSOFT Distinguished Paper
Technical Papers
Clay Stevens University of Nebraska-Lincoln, Hamid Bagheri University of Nebraska-Lincoln
DOI
07:40
20m
Talk
Evolution-Aware Detection of Order-Dependent Flaky Tests
Technical Papers
Chengpeng Li University of Texas at Austin, August Shi University of Texas at Austin
DOI
08:00
20m
Talk
FDG: A Precise Measurement of Fault Diagnosability Gain of Test Cases
Technical Papers
Gabin An KAIST, Shin Yoo KAIST
DOI Pre-print