Preliminary comparison of techniques for dealing with imbalance in software defect prediction (EASE 2024 - Most Influential Paper Award)

Tue 18 - Fri 21 June 2024 Salerno, Italy

Who

Daniel Rodriguez, Israel Herraiz, Rachel Harrison, José Javier Dolado, José C. Riquelm

Track

EASE 2024 Most Influential Paper Award

Abstract

Imbalanced data is a common problem in data mining when dealing with classification problems, where samples of a class vastly outnumber other classes. In this situation, many data mining algorithms generate poor models as they try to optimize the overall accuracy and perform badly in classes with very few samples. Software Engineering data in general and defect prediction datasets are not an exception and in this paper, we compare different approaches, namely sampling, cost-sensitive, ensemble and hybrid approaches to the problem of defect prediction with different datasets preprocessed differently. We have used the well-known NASA datasets curated by Shepperd et al. There are differences in the results depending on the characteristics of the dataset and the evaluation metrics, especially if duplicates and inconsistencies are removed as a preprocessing step. Further Results and replication package: http://www.cc.uah.es/drg/ease14

Link to Publication

https://dl.acm.org/doi/10.1145/2601248.2601294

Preliminary comparison of techniques for dealing with imbalance in software defect prediction

Daniel Rodriguez

The university of Alcala

Spain

Israel Herraiz

Rachel Harrison

Oxford Brookes University

United Kingdom

José Javier Dolado

José C. Riquelm

Tracks

Workshops