Dazzle: Using Optimized Generative Adversarial Networks to Address Security Data Class Imbalance Issue (MSR 2022 - Technical Papers)

Who

Rui Shu, Tianpei Xia, Laurie Williams, Tim Menzies

Track

MSR 2022 Technical Papers

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 18 May 2022 13:00 - 13:07 at MSR Main room - odd hours - Session 4: Software Quality (Bugs & Smells) Chair(s): Maxime Lamothe, Mahmoud Alfadel

Abstract

Background: Machine learning techniques have been widely used and demonstrate promising performance in many software security tasks such as software vulnerability prediction. However, the class ratio within software vulnerability datasets is often highly imbalanced (since the percentage of observed vulnerability is usually very low). Goal: To help security practitioners address software security data class imbalanced issues and further help build better prediction models with resampled datasets. Method: We introduce an approach called Dazzle which is an optimized version of conditional Wasserstein Generative Adversarial Networks with gradient penalty (cWGAN-GP). Dazzle explores the architecture hyperparameters of cWGAN-GP with a novel optimizer called Bayesian Optimization. We use Dazzle to generate minority class samples to resample the original imbalanced training dataset. Results: We evaluate Dazzle with three software security datasets, i.e., Moodle vulnerable files, Ambari bug reports, and JavaScript function code. We show that Dazzle is practical to use and demonstrates promising improvement over existing state-of-the-art oversampling techniques such as SMOTE (e.g., with an average of about 60% improvement rate over SMOTE in recall among all datasets). Conclusion: Based on this study, we would suggest the use of optimized GANs as an alternative method for security vulnerability data class imbalanced issues.

Rui Shu

North Carolina State University

United States

Tianpei Xia

North Carolina State University

United States

Laurie Williams

North Carolina State University

United States

Tim Menzies

North Carolina State University

United States

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 18 May
Displayed time zone: Eastern Time (US & Canada) change

13:00 - 13:50	Session 4: Software Quality (Bugs & Smells)Data and Tool Showcase Track / Technical Papers at MSR Main room - odd hours Chair(s): Maxime Lamothe Polytechnique Montreal, Montreal, Canada, Mahmoud Alfadel University of Waterloo

13:00 7m Talk		Dazzle: Using Optimized Generative Adversarial Networks to Address Security Data Class Imbalance Issue Technical Papers Rui Shu North Carolina State University, Tianpei Xia North Carolina State University, Laurie Williams North Carolina State University, Tim Menzies North Carolina State University
13:07 7m Talk		To What Extent do Deep Learning-based Code Recommenders Generate Predictions by Cloning Code from the Training Set? Technical Papers Matteo Ciniselli Università della Svizzera Italiana, Luca Pascarella Università della Svizzera italiana (USI), Gabriele Bavota Software Institute, USI Università della Svizzera italiana Pre-print
13:14 7m Talk		How to Improve Deep Learning for Software Analytics (a case study with code smell detection) Technical Papers Rahul Yedida , Tim Menzies North Carolina State University Pre-print
13:21 7m Talk		Using Active Learning to Find High-Fidelity Builds Technical Papers Harshitha Menon Lawrence Livermore National Lab, Konstantinos Parasyris Lawrence Livermore National Laboratory, Todd Gamblin Lawrence Livermore National Laboratory, Tom Scogland Lawrence Livermore National Laboratory Pre-print
13:28 4m Talk		ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction Data and Tool Showcase Track Hossein Keshavarz David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada, Mei Nagappan University of Waterloo Pre-print
13:32 4m Talk		ReCover: a Curated Dataset for Regression Testing Research Data and Tool Showcase Track Francesco Altiero Università degli Studi di Napoli Federico II, Anna Corazza Università degli Studi di Napoli Federico II, Sergio Di Martino Università degli Studi di Napoli Federico II, Adriano Peron Università degli Studi di Napoli Federico II, Luigi Libero Lucio Starace Università degli Studi di Napoli Federico II
13:36 14m Live Q&A		Discussions and Q&A Technical Papers

Information for Participants

Wed 18 May 2022 13:00 - 13:50 at MSR Main room - odd hours - Session 4: Software Quality (Bugs & Smells) Chair(s): Maxime Lamothe, Mahmoud Alfadel

Info for room MSR Main room - odd hours:

Click here to go to the room on Midspace