SCS-Gan: Learning Functionality-Agnostic Stylometric Representations for Source Code Authorship Verification (ICSE 2023 - Journal-First Papers)

Who

Weihan Ou, Ding Steven, H., H., Yuan Tian, Leo Song

Track

ICSE 2023 Journal-First Papers

Time Zone

The program is currently displayed in (GMT+10:00) Hobart.

Use conference time zone: (GMT+10:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 17 May 2023 12:07 - 12:15 at Meeting Room 102 - Mining software repositories Chair(s): Brittany Johnson

Abstract

In recent years, the number of anonymous script-based fileless malware attacks, software copyright disputes, and code plagiarism issues has increased rapidly. In the literature, automated Code Authorship Analysis (CAA) techniques have been proposed to reduce the manual effort in identifying those attacks and issues. Most CAA techniques aim to solve the task of Authorship Attribution (AA), i.e., identifying the actual author of a source code fragment from a given set of candidate authors. However, in many real-world scenarios, investigators do not have a predefined set of authors containing the actual author at the time of investigation, i.e., contradicting AA’s assumption. Additionally, existing AA techniques ignore the influence of code functionality when identifying the authorship, which leads to biased matching simply based on code functionality.

Different from AA, the task of (extreme) Authorship Verification (AV) is to decide if two texts were written by the same person or not. AV techniques do not need a predefined author set and thus could be applied in more code authorship-related applications than AA. To our knowledge, there is no previous work attempting to solve the AV problem for the source code. To fill the gap, we propose a novel adversarial neural network, namely SCS-Gan, that can learn a stylometric representation of code for automated AV. With the multi-head attention mechanism, SCS-Gan focuses on the code parts that are most informative regarding personal styles and generates functionality-agnostic stylometric representations through adversarial training. We benchmark SCS-Gan and two state-of-the-art code representation models on four out-of-sample datasets collected from a real-world programming competition. Our experiment results show that SCS-Gan outperforms the baselines on all four out-of-sample datasets.

Weihan Ou

Queen's University at Kingston

Ding Steven, H., H.

Queen’s University at Kingston

Yuan Tian

Queens University, Kingston, Canada

Canada

Leo Song