On the Reliability of the Area Under the ROC Curve in Empirical Software Engineering (EASE 2023 - Research (Full Papers))

Who

Luigi Lavazza, Sandro Morasca, Gabriele Rotoloni

Track

EASE 2023 Research (Full Papers)

Time Zone

The program is currently displayed in (GMT+03:00) Athens.

Use conference time zone: (GMT+03:00) AthensSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 14 Jun 2023 15:30 - 15:50 at Aurora Hall - Methodology and Secondary Studies Chair(s): Thomas Fehlmann

Abstract

Binary classifiers are commonly used in software engineering research, to estimate several software qualities, e.g., defectiveness or vulnerability. Thus, it is important to adequately evaluate how well binary classifiers perform, before they are used in practice. The Area Under the Curve (AUC) of Receiver Operating Characteristic curves has often been used to this end. However, AUC has been widely criticized, so it is necessary to evaluate under what conditions and to what extent AUC can be a reliable performance metric.

We analyze AUC in relation to φ (also known as Matthews Correlation Coefficient), often considered a more reliable performance metric, by building the lines in the ROC space with constant value of φ, for several values of φ, and computing the corresponding values of AUC.

By their very definitions, AUC and φ depend on the prevalence ρ of a dataset, which is the proportion of its positive instances (e.g., the defective software modules). Hence, so does the relationship between AUC and φ. It turns out that AUC and φ are very well correlated, and therefore provide concordant indications, for balanced datasets (those with ρ around 0.5). Instead, AUC tends to become quite large, and hence provide over-optimistic indications, for very imbalanced datasets (those with ρ close to 0 or 1).

We use examples from the software engineering literature to illustrate the analytical relationship linking AUC, φ and ρ. We show that, for some values of ρ, the evaluation of performance based exclusively on AUC can be deceiving. In conclusion, this paper provides some guidelines for an informed usage and interpretation of AUC.

Link to Preprint

https://easychair.org/publications/preprint/srQN

File attachments

Presentation-4023 (4023-PresentationEASE2023.pptx)	728KiB

Luigi Lavazza

Università degli Studi dell'Insubria

Italy

Sandro Morasca