Python has become the most popular programming language as it is friendly to work with for beginners. However, a recent study has found that most security issues in Python have not been indexed by CVE and may only be fixed by “silent” security commits, which pose a threat to software security and hinder the security fixes to downstream software. It is critical to identify the hidden security commits; however, the existing datasets and methods are insufficient for security commit detection in Python, due to the limited data variety, non-comprehensive code semantics, and uninterpretable learned features. In this paper, we construct the first security commit dataset in Python, namely PySecDB, which consists of three subsets including a base dataset, a pilot dataset, and an augmented dataset. The base dataset contains the security commits associated with CVE records provided by MITRE. To increase the variety of security commits, we build the pilot dataset from GitHub by filtering keywords within the commit messages. Since not all commits provide commit messages, we further construct the augmented dataset by understanding the semantics of code changes. To build the augmented dataset, we propose a new graph representation named CommitCPG and a multi-attributed graph learning model named SCOPY to identify the security commit candidates through both sequential and structural code semantics. The evaluation shows our proposed algorithms can improve the data collection efficiency by up to 40 percentage points. After manual verification by three security experts, PySecDB consists of 1,258 security commits and 2,791 non-security commits. Furthermore, we conduct an extensive case study on PySecDB and discover four common security fix patterns that cover over 85% of security commits in Python, providing insight into secure software maintenance, vulnerability detection, and automated program repair.
Thu 5 OctDisplayed time zone: Bogota, Lima, Quito, Rio Branco change
13:30 - 15:00 | Security and Program RepairResearch Track / Industry Track at Session 1 Room - RGD 004 Chair(s): Quentin Stiévenart Université du Québec à Montréal (UQAM), Ashkan Sami Edinburgh Napier University | ||
13:30 16mTalk | Enhancing Code Language Models for Program Repair by Curricular Fine-tuning Framework Research Track Sichong Hao Faculty of Computing, Harbin Institute of Technology, Xianjun Shi Faculty of Computing, Harbin Institute of Technology, Hongwei Liu Faculty of Computing, Harbin Institute of Technology, Yanjun Shu Faculty of Computing, Harbin Institute of Technology | ||
13:46 16mTalk | ScaleFix: An Automated Repair of UI Scaling Accessibility Issues in Android Applications Research Track Ali S. Alotaibi University of Southern California, Paul T. Chiou University of Southern California, Fazle Mohammed Tawsif University of Southern California, William G.J. Halfond University of Southern California | ||
14:02 16mTalk | Finding an Optimal Set of Static Analyzers To Detect Software Vulnerabilities Industry Track Jiaqi He University of Alberta, Revan MacQueen University of Alberta, Natalie Bombardieri University of Alberta, Karim Ali University of Alberta, James Wright University of Alberta, Cristina Cifuentes Oracle Labs | ||
14:18 16mTalk | DockerCleaner: Automatic Repair of Security Smells in Dockerfiles Research Track Quang-Cuong Bui Hamburg University of Technology, Malte Laukötter Hamburg University of Technology, Riccardo Scandariato Hamburg University of Technology Pre-print | ||
14:34 16mTalk | Exploring Security Commits in Python Research Track Shiyu Sun George Mason University, Shu Wang George Mason University, Xinda Wang George Mason University, Yunlong Xing George Mason University, Elisa Zhang Dougherty Valley High School, Kun Sun George Mason University Pre-print | ||
14:50 10mLive Q&A | 1:1 Q&A Research Track |