Does data sampling improve deep learning-based vulnerability detection? Yeas! and Nays! (ICSE 2023 - Technical Track) - ICSE 2023

Write a Blog >>

Sun 14 - Sat 20 May 2023 Melbourne, Australia

Who

Xu Yang, Shaowei Wang, Yi Li, Shaohua Wang

Track

ICSE 2023 Technical Track

Time Zone

The program is currently displayed in (GMT+10:00) Hobart.

Use conference time zone: (GMT+10:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

When

Fri 19 May 2023 14:45 - 15:00 at Meeting Room 106 - Vulnerability detection Chair(s): Cuiyun Gao

Abstract

Recent progress in Deep Learning (DL) has sparked interest in using DL to detect software vulnerabilities automatically and it has been demonstrated promising results at detecting vulnerabilities. However, one prominent and practical issue for vulnerability detection is data imbalance. Prior study observed that the performance of state-of-the-art (SOTA) DL-based vulnerability detection (DLVD) approaches drops precipitously in real world imbalanced data and a 73% drop of F1-score on average across studied approaches. Such a significant performance drop can disable the practical usage of any DLVD approaches. Data sampling is effective in alleviating data imbalance for machine learning models and has been demonstrated in various software engineering tasks. Therefore, in this study, we conducted a systematical and extensive study to assess the impact of data sampling for data imbalance problem in DLVD from two aspects: i) the effectiveness of DLVD, and ii) the ability of DLVD to reason correctly (making a decision based on real vulnerable statements). We found that in general, oversampling outperforms undersampling, and sampling on raw data outperforms sampling on latent space, typically random oversampling on raw data performs the best among all studied ones (including advanced one SMOTE and OSS). Surprisingly, OSS does not help alleviate the data imbalance issue in DLVD at all. If the recall is pursued, random undersampling is the best choice. Random oversampling on raw data also improves the ability of DLVD approaches for learning real vulnerable patterns. However, for a significant portion of cases (at least 33% in our datasets), DVLD approach cannot reason their prediction based on real vulnerable statements. We provide actionable suggestions and a roadmap to practitioners and researchers.

Link to Preprint

https://shaoweiwang2010.github.io/papers/ICSE_2023_Sampling_Vulnerablity.pdf

Xu Yang

University of Manitoba

Shaowei Wang

University of Manitoba

Canada

Yi Li

New Jersey Institute of Technology

United States

Shaohua Wang

New Jersey Institute of Technology

United States

Time Zone

The program is currently displayed in (GMT+10:00) Hobart.

Use conference time zone: (GMT+10:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Session Program

Fri 19 May
Displayed time zone: Hobart change

	13:45 - 15:15	Vulnerability detectionTechnical Track / Journal-First Papers at Meeting Room 106 Chair(s): Cuiyun Gao Harbin Institute of Technology

	13:45 15m Talk		An Empirical Study of Deep Learning Models for Vulnerability Detection Technical Track Benjamin Steenhoek Iowa State University, Md Mahbubur Rahman Iowa State University, Richard Jiles Iowa State University, Wei Le Iowa State University Pre-print
	14:00 15m Talk		DeepVD: Toward Class-Separation Features for Neural Network Vulnerability Detection Technical Track Wenbo Wang New Jersey Institute of Technology, Tien N. Nguyen University of Texas at Dallas, Shaohua Wang New Jersey Institute of Technology, Yi Li New Jersey Institute of Technology, Jiyuan Zhang University of Illinois Urbana-Champaign, Aashish Yadavally The University of Texas at Dallas Pre-print
	14:15 15m Talk		Enhancing Deep Learning-based Vulnerability Detection by Building Behavior Graph Model Technical Track Bin Yuan Huazhong University of Science and Technology, Yifan Lu Huazhong University of Science and Technology, Yilin Fang Huazhong University of Science and Technology, Yueming Wu Nanyang Technological University, Deqing Zou Huazhong University of Science and Technology, Zhen Li Huazhong University of Science and Technology, Zhi Li Huazhong University of Science and Technology, Hai Jin Huazhong University of Science and Technology
	14:30 15m Talk		Vulnerability Detection with Graph Simplification and Enhanced Graph Representation Learning Technical Track Xin-Cheng Wen Harbin Institute of Technology, Yupan Harbin Institute of Technology, Cuiyun Gao Harbin Institute of Technology, Hongyu Zhang The University of Newcastle, Jie M. Zhang King's College London, Qing Liao Harbin Institute of Technology
	14:45 15m Talk		Does data sampling improve deep learning-based vulnerability detection? Yeas! and Nays! Technical Track Xu Yang University of Manitoba, Shaowei Wang University of Manitoba, Yi Li New Jersey Institute of Technology, Shaohua Wang New Jersey Institute of Technology Pre-print
	15:00 7m Talk		Learning from What We Know: How to Perform Vulnerability Prediction using Noisy Historical Data Journal-First Papers Aayush Garg University of Luxembourg, Luxembourg, Renzo Degiovanni SnT, University of Luxembourg, Matthieu Jimenez SnT, University of Luxembourg, Maxime Cordy University of Luxembourg, Luxembourg, Mike Papadakis University of Luxembourg, Luxembourg, Yves Le Traon University of Luxembourg, Luxembourg Link to publication DOI Authorizer link Pre-print Media Attached
	15:07 7m Talk		Do I really need all this work to find vulnerabilities? An empirical case study comparing vulnerability detection techniques on a Java application Journal-First Papers Sarah Elder North Carolina State University, Nusrat Zahan North Carolina State University, Rui Shu North Carolina State University, Valeri Kozarev North Carolina State University, Tim Menzies North Carolina State University, Laurie Williams North Carolina State University