Ghost Echoes Revealed: Benchmarking Maintainability Metrics and Machine Learning Predictions Against Human Assessments (ICSME 2024 - Industry Track)

Who

Markus Borg, Marwa Ezzouhri, Adam Tornhill

Track

ICSME 2024 Industry Track

Time Zone

The program is currently displayed in (GMT-07:00) Arizona.

Use conference time zone: (GMT-07:00) ArizonaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 9 Oct 2024 16:10 - 16:25 at Fremont - Session 6: Maintenance of AI-based Systems Chair(s): Sujoy Roychowdhury

Abstract

As generative AI is expected to increase global code volumes, the importance of maintainability from a human perspective will become even greater. Various methods have been developed to identify the most important maintainability issues, including aggregated metrics and advanced Machine Learning (ML) models. This study benchmarks several maintainability prediction approaches, including State-of-the-Art (SotA) ML, SonarQube’s Maintainability Rating, CodeScene’s Code Health, and Microsoft’s Maintainability Index. Our results indicate that CodeScene matches the accuracy of SotA ML and outperforms the average human expert. Importantly, unlike SotA ML, CodeScene also provides end users with actionable code smell details to remedy identified issues. Finally, caution is advised with SonarQube due to its tendency to generate many false positives. Unfortunately, our findings call into question the validity of previous studies that solely relied on SonarQube output for establishing ground truth labels. To improve reliability in future maintainability and technical debt studies, we recommend employing more accurate metrics. Moreover, reevaluating previous findings with Code Health would mitigate this revealed validity threat.

Link to Preprint

https://arxiv.org/abs/2408.10754

Markus Borg

CodeScene

Sweden

Marwa Ezzouhri

University of Clermont Auvergne

France

Adam Tornhill

Codescene AB