PENTACET data - 23 Million Code Comments and 500,000 SATD comments
Most SATD research utilizes non-probabilistic sampling for data selection, which weakens the empirical findings’ generalization capability. A closer look reveals several SATD research are based on simple (`Easy to find’) code comments without the contextual data (preceding and succeeding source code context). In this work, we address this gap through PENTACET (or 5C) dataset. PENTACET is a large Curated Contextual Code Comments per Contributor and the most extensive SATD data. It is acquired by mining 9,096 Open Source Software Java projects with a total of 435 million LOC and captures bi-directional contextual information of all source code granularities in more than 26 million source code files. The outcome is data set with 23 million code comments, source code context for each comment, and more than 500,000 comments labeled as SATD.
Tue 16 MayDisplayed time zone: Hobart change
11:00 - 11:45 | Documentation + Q&A IITechnical Papers / Data and Tool Showcase Track at Meeting Room 109 Chair(s): Maram Assi Queen's University | ||
11:00 12mTalk | Understanding the Role of Images on Stack Overflow Technical Papers Dong Wang Kyushu University, Japan, Tao Xiao Nara Institute of Science and Technology, Christoph Treude University of Melbourne, Raula Gaikovina Kula Nara Institute of Science and Technology, Hideaki Hata Shinshu University, Yasutaka Kamei Kyushu University Pre-print | ||
11:12 12mTalk | Do Subjectivity and Objectivity Always Agree? A Case Study with Stack Overflow Questions Technical Papers Saikat Mondal University of Saskatchewan, Masud Rahman Dalhousie University, Chanchal K. Roy University of Saskatchewan Pre-print | ||
11:24 6mTalk | GiveMeLabeledIssues: An Open Source Issue Recommendation System Data and Tool Showcase Track Joseph Vargovich Northern Arizona University, Fabio Marcos De Abreu Santos Northern Arizona University, USA, Jacob Penney Northern Arizona University, Marco Gerosa Northern Arizona University, Igor Steinmacher Northern Arizona University Pre-print Media Attached | ||
11:30 6mTalk | DocMine: A Software Documentation-Related Dataset of 950 GitHub Repositories Data and Tool Showcase Track | ||
11:36 6mTalk | PENTACET data - 23 Million Code Comments and 500,000 SATD comments Data and Tool Showcase Track Murali Sridharan University of Oulu, Leevi Rantala University of Oulu, Mika Mäntylä University of Oulu |