SCS-Gan: Learning Functionality-Agnostic Stylometric Representations for Source Code Authorship Verification
In recent years, the number of anonymous script-based fileless malware attacks, software copyright disputes, and code plagiarism issues has increased rapidly. In the literature, automated Code Authorship Analysis (CAA) techniques have been proposed to reduce the manual effort in identifying those attacks and issues. Most CAA techniques aim to solve the task of Authorship Attribution (AA), i.e., identifying the actual author of a source code fragment from a given set of candidate authors. However, in many real-world scenarios, investigators do not have a predefined set of authors containing the actual author at the time of investigation, i.e., contradicting AA’s assumption. Additionally, existing AA techniques ignore the influence of code functionality when identifying the authorship, which leads to biased matching simply based on code functionality.
Different from AA, the task of (extreme) Authorship Verification (AV) is to decide if two texts were written by the same person or not. AV techniques do not need a predefined author set and thus could be applied in more code authorship-related applications than AA. To our knowledge, there is no previous work attempting to solve the AV problem for the source code. To fill the gap, we propose a novel adversarial neural network, namely SCS-Gan, that can learn a stylometric representation of code for automated AV. With the multi-head attention mechanism, SCS-Gan focuses on the code parts that are most informative regarding personal styles and generates functionality-agnostic stylometric representations through adversarial training. We benchmark SCS-Gan and two state-of-the-art code representation models on four out-of-sample datasets collected from a real-world programming competition. Our experiment results show that SCS-Gan outperforms the baselines on all four out-of-sample datasets.
Wed 17 MayDisplayed time zone: Hobart change
11:00 - 12:30 | Mining software repositoriesTechnical Track / Journal-First Papers / DEMO - Demonstrations at Meeting Room 102 Chair(s): Brittany Johnson George Mason University | ||
11:00 15mTalk | The untold story of code refactoring customizations in practice Technical Track Daniel Oliveira PUC-Rio, Wesley Assunção Johannes Kepler University Linz, Austria & Pontifical Catholic University of Rio de Janeiro, Brazil, Alessandro Garcia PUC-Rio, Ana Carla Bibiano PUC-Rio, Márcio Ribeiro Federal University of Alagoas, Brazil, Rohit Gheyi Federal University of Campina Grande, Baldoino Fonseca Federal University of Alagoas (UFAL) Pre-print | ||
11:15 15mTalk | Data Quality for Software Vulnerability Datasets Technical Track Roland Croft The University of Adelaide, Muhammad Ali Babar University of Adelaide, M. Mehdi Kholoosi University of Adelaide Pre-print | ||
11:30 15mTalk | Do code refactorings influence the merge effort? Technical Track André Oliveira Federal Fluminense University, Vania Neves Universidade Federal Fluminense (UFF), Alexandre Plastino Federal Fluminense University, Ana Carla Bibiano PUC-Rio, Alessandro Garcia PUC-Rio, Leonardo Murta Universidade Federal Fluminense (UFF) | ||
11:45 7mTalk | ActionsRemaker: Reproducing GitHub Actions DEMO - Demonstrations Hao-Nan Zhu University of California, Davis, Kevin Guan University of California, Davis, Robert M. Furth University of California, Davis, Cindy Rubio-González University of California at Davis | ||
11:52 7mTalk | Problems with with SZZ and Features: An empirical assessment of the state of practice of defect prediction data collection Journal-First Papers Steffen Herbold University of Passau, Alexander Trautsch University of Passau, Alexander Trautsch Germany, Benjamin Ledel None | ||
12:00 7mTalk | An empirical study of issue-link algorithms: which issue-link algorithms should we use? Journal-First Papers Masanari Kondo Kyushu University, Yutaro Kashiwa Nara Institute of Science and Technology, Yasutaka Kamei Kyushu University, Osamu Mizuno Kyoto Institute of Technology | ||
12:07 7mTalk | SCS-Gan: Learning Functionality-Agnostic Stylometric Representations for Source Code Authorship Verification Journal-First Papers Weihan Ou Queen's University at Kingston, Ding Steven, H., H. Queen’s University at Kingston, Yuan Tian Queens University, Kingston, Canada, Leo Song Queen’s University at Kingston | ||
12:15 15mTalk | A Comprehensive Study of Real-World Bugs in Machine Learning Model Optimization Technical Track Hao Guan The University of Queensland, Ying Xiao Southern University of Science and Technology, Jiaying LI Microsoft, Yepang Liu Southern University of Science and Technology, Guangdong Bai University of Queensland |