Write a Blog >>
ICSE 2023
Sun 14 - Sat 20 May 2023 Melbourne, Australia
Wed 17 May 2023 11:15 - 11:30 at Meeting Room 102 - Mining software repositories Chair(s): Brittany Johnson

The use of learning-based techniques to achieve automated software vulnerability detection has been of longstanding interest within the software security domain. These data-driven solutions are enabled by large software vulnerability datasets used for training and benchmarking. However, we observe that the quality of the data powering these solutions is currently ill-considered, hindering the reliability and value of produced outcomes. Whilst awareness of software vulnerability data preparation challenges is growing, there has been little investigation into the potential negative impacts of software vulnerability data quality. For instance, we lack confirmation that vulnerability labels are correct or consistent. Our study seeks to address such shortcomings by inspecting five inherent data quality attributes for four state-of-the-art software vulnerability datasets and the subsequent impacts that issues can have on software vulnerability prediction models. Surprisingly, we found that all the datasets exhibit severe data quality problems. In particular, we found 20-71% of vulnerability labels to be inaccurate in real-world datasets, and 17-99% of data points were duplicated. We observed that these issues had significant impacts on downstream models, either preventing effective model training or inflating benchmark performance. We advocate for the need to overcome such challenges. Our findings will enable better consideration and assessment of software vulnerability data quality in the future.

Wed 17 May

Displayed time zone: Hobart change

11:00 - 12:30
Mining software repositoriesTechnical Track / Journal-First Papers / DEMO - Demonstrations at Meeting Room 102
Chair(s): Brittany Johnson George Mason University
11:00
15m
Talk
The untold story of code refactoring customizations in practice
Technical Track
Daniel Oliveira PUC-Rio, Wesley Assunção Johannes Kepler University Linz, Austria & Pontifical Catholic University of Rio de Janeiro, Brazil, Alessandro Garcia PUC-Rio, Ana Carla Bibiano PUC-Rio, Márcio Ribeiro Federal University of Alagoas, Brazil, Rohit Gheyi Federal University of Campina Grande, Baldoino Fonseca Federal University of Alagoas (UFAL)
Pre-print
11:15
15m
Talk
Data Quality for Software Vulnerability Datasets
Technical Track
Roland Croft The University of Adelaide, Muhammad Ali Babar University of Adelaide, M. Mehdi Kholoosi University of Adelaide
Pre-print
11:30
15m
Talk
Do code refactorings influence the merge effort?
Technical Track
André Oliveira Federal Fluminense University, Vania Neves Universidade Federal Fluminense (UFF), Alexandre Plastino Federal Fluminense University, Ana Carla Bibiano PUC-Rio, Alessandro Garcia PUC-Rio, Leonardo Murta Universidade Federal Fluminense (UFF)
11:45
7m
Talk
ActionsRemaker: Reproducing GitHub Actions
DEMO - Demonstrations
Hao-Nan Zhu University of California, Davis, Kevin Z. Guan University of California, Davis, Robert M. Furth University of California, Davis, Cindy Rubio-González University of California at Davis
11:52
7m
Talk
Problems with with SZZ and Features: An empirical assessment of the state of practice of defect prediction data collection
Journal-First Papers
Steffen Herbold University of Passau, Alexander Trautsch University of Passau, Alexander Trautsch Germany, Benjamin Ledel None
12:00
7m
Talk
An empirical study of issue-link algorithms: which issue-link algorithms should we use?
Journal-First Papers
Masanari Kondo Kyushu University, Yutaro Kashiwa Nara Institute of Science and Technology, Yasutaka Kamei Kyushu University, Osamu Mizuno Kyoto Institute of Technology
12:07
7m
Talk
SCS-Gan: Learning Functionality-Agnostic Stylometric Representations for Source Code Authorship Verification
Journal-First Papers
Weihan Ou Queen's University at Kingston, Ding Steven, H., H. Queen’s University at Kingston, Yuan Tian Queens University, Kingston, Canada, Leo Song Queen’s University at Kingston
12:15
15m
Talk
A Comprehensive Study of Real-World Bugs in Machine Learning Model Optimization
Technical Track
Hao Guan The University of Queensland, Ying Xiao Southern University of Science and Technology, Jiaying LI Microsoft, Yepang Liu Southern University of Science and Technology, Guangdong Bai University of Queensland