From Reinvention to Reuse: An Empirical Example Study On Technical Debt Dataset (PROFES 2024 - Research Papers)

Who

Leevi Rantala, Mika Mäntylä, Murali Sridharan

Track

PROFES 2024 Research Papers

Time Zone

The program is currently displayed in (GMT+02:00) Athens.

Use conference time zone: (GMT+02:00) AthensSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 3 Dec 2024 16:18 - 16:36 at UT Library - Room 3 (Seminar Room Kodavere) - PROFES Session 6: Technical Debt Chair(s): Eriks Klotins

Abstract

Self-Admitted Technical Debt (SATD) is a subset of Technical Debt (TD), where the developer leaves a comment on the source, thus marking the place where debt has been taken. Previous research on SATD relies on either the creation of new datasets or the reuse of existing ones. One seminal SATD dataset containing over 4,000 SATD comments and their classification into five different TD categories was published by Maldonado et al. The drawback of the dataset is its lack of any other information, e.g. static analysis, seriously limiting its possible use cases. We remedy this situation by reforming the dataset. We combine the original comments with contextual information and static analysis from the source codes and recreate the dataset as an SQLite database. Our reformed dataset contains over 13,000 files, nearly 14,000 classes, almost 100,000 methods, and over 650,000 code violation instances. The reformed dataset allows varied and detailed analyses in the future, which we demonstrate by examining the relationship of SATD comments to code violations. The results show that on the method level, the most important predictors are the number of code violations in total as well as the number of violations labelled as Priority 3 or belonging to the Documentation Rule Set. On the file level, LOC is an important predictor alongside the number of violations from the Documentation Rule Set or having a Priority 2 classification. Overall, our example study demonstrates the potential of what reforming existing datasets can have.

Leevi Rantala

University of Oulu

Mika Mäntylä

University of Helsinki and University of Oulu

Finland

Murali Sridharan