ReposVul: A Repository-Level High-Quality Vulnerability Dataset (ICSE 2024 - Industry Challenge Track)

Fri 12 - Sun 21 April 2024 Lisbon, Portugal

Who

Xinchen Wang, Ruida Hu, Cuiyun Gao, Xin-Cheng Wen, Yujia Chen, Qing Liao

Track

ICSE 2024 Industry Challenge Track

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 17 Apr 2024 14:45 - 15:00 at Amália Rodrigues - Evolution 1 Chair(s): Jonathan Sillito

Abstract

Open-Source Software (OSS) vulnerabilities bring great challenges to the software security and pose potential risks to our society. Enormous efforts have been devoted into automated vulnerability detection, among which deep learning (DL)-based approaches have proven to be the most effective. However, the performance of the DL-based approaches generally relies on the quantity and quality of labeled data, and the current labeled data present the following limitations: (1) \textbf{Tangled Patches}: Developers may submit code changes unrelated to vulnerability fixes within patches, leading to tangled patches. (2) \textbf{Lacking Inter-procedural Vulnerabilities}: The existing vulnerability datasets typically contain function-level and file-level vulnerabilities, ignoring the relations between functions, thus rendering the approaches unable to detect the inter-procedural vulnerabilities. (3) \textbf{Outdated Patches}: The existing datasets usually contain outdated patches, which may bias the model during training.

To address the above limitations, in this paper, we propose an automated data collection framework and construct the first repository-level high-quality vulnerability dataset named \textbf{ReposVul}. The proposed framework mainly contains three modules: (1) A vulnerability untangling module, aiming at distinguishing vulnerability-fixing related code changes from tangled patches, in which the Large Language Models (LLMs) and static analysis tools are jointly employed. (2) A multi-granularity dependency extraction module, aiming at capturing the inter-procedural call relationships of vulnerabilities, in which we construct multiple-granularity information for each vulnerability patch, including repository-level, file-level, function-level, and line-level. (3) A trace-based filtering module, aiming at filtering the outdated patches, which leverages the file path trace-based filter and commit time trace-based filter to construct an up-to-date dataset.

The constructed repository-level ReposVul encompasses 6,134 CVE entries representing 236 CWE types across 1,491 projects and four programming languages. Thorough data analysis and manual checking demonstrate that ReposVul is high in quality and alleviates the problems of tangled and outdated patches in previous vulnerability datasets.

Link to Preprint

https://scholar.google.com/citations?view_op=view_citation&hl=zh-CN&user=QG5jMcYAAAAJ&citation_for_view=QG5jMcYAAAAJ:9yKSN-GCB0IC

File attachments

(ICSE论文.pdf)	2.65MiB

Xinchen Wang

Harbin Institute of Technology

Ruida Hu

Harbin Institute of Technology, Shenzhen

China

Cuiyun Gao

Harbin Institute of Technology

China

Xin-Cheng Wen

Harbin Institute of Technology

Yujia Chen

Harbin Institute of Technology, Shenzhen

Qing Liao

Harbin Institute of Technology

China

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 17 Apr
Displayed time zone: Lisbon change

14:00 - 15:30	Evolution 1Research Track / Journal-first Papers / Demonstrations / Industry Challenge Track at Amália Rodrigues Chair(s): Jonathan Sillito Brigham Young University

14:00 15m Talk		Large Language Models are Few-Shot Summarizers: Multi-Intent Comment Generation via In-Context Learning Research Track Mingyang Geng National University of Defense Technology, Shangwen Wang National University of Defense Technology, Dezun Dong NUDT, Haotian Wang National University of Defense Technolog, Ge Li Peking University, Zhi Jin Peking University, Xiaoguang Mao National University of Defense Technology, Liao Xiangke National University of Defense Technology DOI Pre-print
14:15 15m Talk		Block-based Programming for Two-Armed Robots: A Comparative Study Research Track Felipe Fronchetti Virginia Commonwealth University, Nico Ritschel University of British Columbia, Logan Schorr Virginia Commonwealth University, Chandler Barfield Virginia Commonwealth University, Gabriella Chang Virginia Commonwealth University, Rodrigo Spinola Virginia Commonwealth University, Reid Holmes University of British Columbia, David C. Shepherd Louisiana State University DOI Pre-print Media Attached
14:30 15m Talk		Exploiting Library Vulnerability via Migration Based Automating Test Generation Research Track Zirui Chen , Xing Hu Zhejiang University, Xin Xia Huawei Technologies, Yi Gao Zhejiang University, Tongtong Xu Huawei, David Lo Singapore Management University, Xiaohu Yang Zhejiang University
14:45 15m Talk		ReposVul: A Repository-Level High-Quality Vulnerability Dataset Industry Challenge Track Xinchen Wang Harbin Institute of Technology, Ruida Hu Harbin Institute of Technology, Shenzhen, Cuiyun Gao Harbin Institute of Technology, Xin-Cheng Wen Harbin Institute of Technology, Yujia Chen Harbin Institute of Technology, Shenzhen, Qing Liao Harbin Institute of Technology Pre-print File Attached
15:00 7m Talk		JOG: Java JIT Peephole Optimizations and Tests from Patterns Demonstrations Zhiqiang Zang The University of Texas at Austin, Aditya Thimmaiah The University of Texas at Austin, Milos Gligoric The University of Texas at Austin DOI Pre-print
15:07 7m Talk		Predicting the Change Impact of Resolving Defects by Leveraging the Topics of Issue Reports in Open Source Software Systems Journal-first Papers Maram Assi Queen's University, Safwat Hassan University of Toronto, Canada, Stefanos Georgiou Queen's University, Ying Zou Queen's University, Kingston, Ontario
15:14 7m Talk		Assessing the Exposure of Software Changes Journal-first Papers Mehran Meidani University of Waterloo, Maxime Lamothe Polytechnique Montreal, Shane McIntosh University of Waterloo Link to publication Pre-print
15:21 7m Talk		Responding to change over time: A longitudinal case study on changes in coordination mechanisms in large‑scale agile Journal-first Papers Marthe Berntzen University of Oslo, Viktoria Stray University of Oslo, Nils Brede Moe , Rashina Hoda Monash University