Interpreting CodeBERT for Semantic Code Clone Detection (APSEC 2023 - Technical Track)

Who

Shamsa Abid, Xuemeng Cai, Lingxiao Jiang

Track

APSEC 2023 Technical Track

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 5 Dec 2023 16:00 - 16:30 at Grand Hall 4 - AI and Software Engineering (3) Chair(s): Jaechang Nam

Abstract

Accurate detection of semantic code clones has many applications in software engineering but is challenging because of lexical, syntactic, or structural dissimilarities in code. CodeBERT, a popular deep neural network based pre-trained code model, can detect code clones with a high accuracy. However, its performance on unseen data is reported to be lower. A challenge is to interpret CodeBERT’s clone detection behavior and isolate the causes of mispredictions. In this paper, we evaluate CodeBERT and interpret its clone detection behavior on the SemanticCloneBench dataset focusing on Java and Python clone pairs. We introduce the use of a black-box model interpretation technique, SHAP, to identify the core features of code that CodeBERT pays attention to for clone prediction. We first perform a manual similarity analysis over a sample of clone pairs to revise clone labels and to assign labels to statements indicating their contribution to core functionality. We then evaluate the correlation between the human and model’s interpretation of core features of code as a measure of CodeBERT’s trustworthiness. We observe only a weak correlation. Finally, we present examples on how to identify causes of mispredictions for CodeBERT. Our technique can help researchers to assess and fine-tune their models’ performance.

Link to Preprint

http://www.mysmu.edu/faculty/lxjiang/papers/apsec23interpretCodeBERT.pdf

Shamsa Abid

Singapore Management University, Singapore

Singapore

Xuemeng Cai

Singapore Management University

Singapore

Lingxiao Jiang

Singapore Management University

Singapore

CloneSHAP Interpreter Code, Data and Results

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 5 Dec
Displayed time zone: Seoul change

16:00 - 17:30	AI and Software Engineering (3)ERA - Early Research Achievements / SEIP - Software Engineering in Practice / Technical Track at Grand Hall 4 Chair(s): Jaechang Nam Handong Global University

16:00 30m Talk		Interpreting CodeBERT for Semantic Code Clone Detection Technical Track Shamsa Abid Singapore Management University, Singapore, Xuemeng Cai Singapore Management University, Lingxiao Jiang Singapore Management University Pre-print Media Attached
16:30 20m Talk		A Novel Statistical Measure for Out-of-Distribution Detection in Data Quality Assurance SEIP - Software Engineering in Practice Tinghui Ouyang National Institute of Informatics, Japan, Isao Echizen National Institute of Informatics, Yoshiki Seo National Institute of Advanced Industrial Science and Technology
16:50 20m Talk		Quality Assurance of A GPT-based Sentiment Analysis System: Adversarial Review Data Generation and Detection SEIP - Software Engineering in Practice Tinghui Ouyang National Institute of Informatics, Japan, Hoang-Quoc Nguyen-Son National Institute of Informatics, Huy H. Nguyen National Institute of Informatics, Isao Echizen National Institute of Informatics, Yoshiki Seo National Institute of Advanced Industrial Science and Technology
17:10 20m Talk		TLDBERT: Leveraging Further Pre-trained Model for Issue Typed Links Detection ERA - Early Research Achievements Huaian Zhou National University of Defense Technology, Tao Wang National University of Defense Technology, Yang Zhang National University of Defense Technology, China, Yang Shen National University of Defense Technology