PENTACET data - 23 Million Code Comments and 500,000 SATD comments (MSR 2023 - Data and Tool Showcase Track)

Dates to be announced Melbourne, Australia

co-located with ICSE 2023

Who

Murali Sridharan, Leevi Rantala, Mika Mäntylä

Track

MSR 2023 Data and Tool Showcase Track

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 16 May 2023 11:36 - 11:42 at Meeting Room 109 - Documentation + Q&A II Chair(s): Maram Assi

Abstract

Most SATD research utilizes non-probabilistic sampling for data selection, which weakens the empirical findings’ generalization capability. A closer look reveals several SATD research are based on simple (`Easy to find’) code comments without the contextual data (preceding and succeeding source code context). In this work, we address this gap through PENTACET (or 5C) dataset. PENTACET is a large Curated Contextual Code Comments per Contributor and the most extensive SATD data. It is acquired by mining 9,096 Open Source Software Java projects with a total of 435 million LOC and captures bi-directional contextual information of all source code granularities in more than 26 million source code files. The outcome is data set with 23 million code comments, source code context for each comment, and more than 500,000 comments labeled as SATD.

Murali Sridharan

University of Oulu

Leevi Rantala

University of Oulu

Mika Mäntylä

University of Oulu

Finland

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 16 May
Displayed time zone: Hobart change

11:00 - 11:45	Documentation + Q&A IITechnical Papers / Data and Tool Showcase Track at Meeting Room 109 Chair(s): Maram Assi Queen's University

11:00 12m Talk		Understanding the Role of Images on Stack Overflow Technical Papers Dong Wang Kyushu University, Japan, Tao Xiao Nara Institute of Science and Technology, Christoph Treude University of Melbourne, Raula Gaikovina Kula Nara Institute of Science and Technology, Hideaki Hata Shinshu University, Yasutaka Kamei Kyushu University Pre-print
11:12 12m Talk		Do Subjectivity and Objectivity Always Agree? A Case Study with Stack Overflow Questions Technical Papers Saikat Mondal University of Saskatchewan, Masud Rahman Dalhousie University, Chanchal K. Roy University of Saskatchewan Pre-print
11:24 6m Talk		GiveMeLabeledIssues: An Open Source Issue Recommendation System Data and Tool Showcase Track Joseph Vargovich Northern Arizona University, Fabio Marcos De Abreu Santos Northern Arizona University, USA, Jacob Penney Northern Arizona University, Marco Gerosa Northern Arizona University, Igor Steinmacher Northern Arizona University Pre-print Media Attached
11:30 6m Talk		DocMine: A Software Documentation-Related Dataset of 950 GitHub Repositories Data and Tool Showcase Track Akhila Sri Manasa Venigalla IIT Tirupati, Sridhar Chimalakonda IIT Tirupati
11:36 6m Talk		PENTACET data - 23 Million Code Comments and 500,000 SATD comments Data and Tool Showcase Track Murali Sridharan University of Oulu, Leevi Rantala University of Oulu, Mika Mäntylä University of Oulu